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Preface 


The 10th International Conference on Educational Data Mining (EDM 2017) is held under the auspices of the 
International Educational Data Mining Society at the Optics Velley Kingdom Plaza Hotel, Wuhan, Hubei Province, 
in China. The conference, held June 25th - June 28th, 2017, follows the nine previous editions (Raleigh 2016, Madrid 
2015, London 2014, Memphis 2013, Chania 2012, Eindhoven 2011, Pittsburgh 2010, Cordoba, 2009 and Montréal 
2008). 

The EDM conference is the leading international forum for high-quality research that leverages educational data, 
learning analytics, and machine learning to answer research questions that shed light on the learning processes. 
Educational data may come from traces that students leave when they interact with learning management systems, 
interactive learning environments, intelligent tutoring systems, educational games or when they participate in other 
data-rich learning contexts. The types of data range from raw log files to data captured by eye-tracking devices or 
other kind of sensors. The methods used by EDM researchers include analytics, data science, data mining, machine 
learning, as well as social network analysis, graph mining, recommender systems, and model building. 

This years conference features two invited talks by: Dr. Jie Tang, Associate Professor with the Department 
of Computer Science and Technology at Tsinghua University; and Dr. Ron Cole, President of Boulder Learning 
Inc. Together with the Journal of Educational Data Mining (JEDM), the EDM 2017 conference supports a JEDM 
Track that provides researchers with a venue to deliver more substantial mature work than is possible in a conference 
proceedings and to present their work to a live audience. The papers submitted to this track followed the JEDM peer 
review process; five papers have been accepted to the track and will be presented at the conference. ‘The abstract 
for the invited talks and accepted JEDM Track papers can be found in the proceedings. 

The main conference invited contributions to the Research Track and Industry Track. We received 122 submissions 
(71 full, 47 short, 4 industry). We accepted 18 full papers (25% acceptance rate) and 32 short papers for oral 
presentation (42% acceptance rate) and an additional 39 for poster presentations, 3 demonstrations. The industry 
track includes all 4 submitted industry papers and | paper initially submitted as a full paper. 

The EDM conference provides opportunities for young researchers, and particularly Ph.D. students, to present 
their research ideas and receive feedback from the peers and more senior researchers. This year, the Doctoral 
Consortium features 6 such presentations. In addition to the main program, the conference includes 3 workshops: 
Graph-based Educational Data Mining (G-EDM 2017); Sharing and Reusing Data & Analytics Methods with Learn- 
Sphere; Deep Learning with Educational Data, and 2 tutorials: Why Data Standards are Critical for EDM and 
AIED; and Principal Stratification for EDM Experiments. 

We thank the sponsors of EDM 2017 for their generous support: 17Zuoye, Coursera, Learnta, and the Prof. 
Ram Kumar Memorial Foundation. We also thank the program committee members and reviewers, who with their 
enthusiastic contributions gave us invaluable support in putting this conference together. Last but not least we thank 
the organizing team. 
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Awards 


Best papers and exemplary paper selection 


The two program chairs selected 5 best paper nominees based on the reviews and meta-reviews for each of those 
paper. The nominees were then sent to the members of the best paper awards committee. Each committee member 
read and ranked each one of the nominees. Ranking was compiled and the best paper award was attributed to the 
most highly ranked paper. The best student paper award was attributed to the most highly ranked paper that was 
also authored by a student. The winner of the best paper award was not eligible to also win the best student paper 
award. 


Best paper/best student papers committee: 


Ryan Baker Michel Desmarais Zach Pardos 
Cristobal Romero Danielle McNamara Didith Rodrigo 
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Best student paper Generalizability of Face-Based Mind Wandering Detection Across Task Con- 
texts 
Angela Stewart, Nigel Bosch and Sidney DMello 


Best paper nominees Zone out no more: Mitigating mind wandering during computerized reading 
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Can Al help MOOCs? 


Jie Tang 
Department of Computer 
Science and Technology at 
Tsinghua University 


jietang@tsinghua.edu.cn 


ABSTRACT 


Massive open online courses (MOOCs) boomed in recent 
years and have attracted millions of users worldwide. It is 
not only transforming higher education but also provides 
fodder for scientific research. In this talk, I am going to first 
introduce the major MOOC platforms in China, for exam- 
ple, XuetangX.com, a similar platform to Coursear and edX, 
is offering thousands of courses to more than 7,000,000 reg- 
istered users. I will also introduce how we leverage AI tech- 
nologies to help enhance student engagement on MOOCs. 
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The evolution of virtual tutors, clinician, and companions: 
A 20-year perspective on conversational agents in 
real-world applications 


Ronald Cole 


Boulder Learning Inc. 
rcole@boulderlearning.com 


ABSTRACT 

The talk will present an overview of research projects ini- 
tiated in 1997 and continue today in 2017, in which 3-D 
computer characters interact with children and adults with 
the aim of improving their language communication skills, 
educational achievement, and/or personal well-being. The 
talk examines how advances in human language and charac- 
ter animation technologies, and research leading to a deeper 
understanding of how to apply these technologies to opti- 
mize engagement and learning, led to positive experiences 
and learning outcomes similar to experienced teachers and 
clinicians,individuals from 5 to 80 years of age, The talk 
concludes with a consideration of how recent advances in 
machine learning algorithms, coupled with cloud-based de- 
livery of automated assessment and instruction, delivered by 
virtual agents, can save teachers millions of hours of time 
annually, and provide EDM researchers with vast amounts 
of speech and language data that can be mined to improve 
students’ learning experiences and outcomes. 
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ABSTRACT 

The three “unidentified” model specifications proposed by 
Beck and Chang (2007) are identified by the Bayesian Knowl- 
edge ‘Tracing model with a non-informative Dirichlet prior 
distribution and an observed sequence that is longer than 
three periods. Although these specifications have the same 
observed learning curve, they generate different likelihood 
given the same data. The paper further shows that the ob- 
served learning curve is not the sufficient statistics of the 
data generating process stipulated by the Bayesian Knowl- 
edge Tracing model. Therefore, it cannot be used in param- 
eter inference of the Bayesian Knowledge Tracing model. 
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ABSTRACT 


Various forms of Peer-Learning Environments are increas- 
ingly being used in post-secondary education, often to help 
build repositories of student generated learning objects. How- 
ever, large classes can result in an extensive repository, which 
can make it more challenging for students to search for suit- 
able objects that both reflect their interests and address 
their knowledge gaps. Recommender Systems for ‘Technol- 
ogy Enhanced Learning (RecSysTEL) offer a potential so- 
lution to this problem by providing sophisticated filtering 
techniques to help students to find the resources that they 
need in a timely manner. Here, a new RecSysTEL for Rec- 
ommendation in Peer-Learning Environments (RiPLE) is 
presented. The approach uses a collaborative filtering algo- 
rithm based upon matrix factorization to create personalized 
recommendations for individual students that address their 
interests and their current knowledge gaps. ‘The approach 
is validated using both synthetic and real data sets. The 
results are promising, indicating RiPLE is able to provide 
sensible personalized recommendations for both regular and 
cold-start users under reasonable assumptions about param- 
eters and user behavior. 


Keywords 
Peer-Learning Environments, Recommender Systems, Knowl- 
edge Gaps 
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ABSTRACT 


Research on non-cognitive factors has shown that persistence in 
the face of challenges plays an important role in learning. 
However, recent work on wheel-spinning, a type of unproductive 
persistence where students spend too much time struggling 
without achieving mastery of skills, show that not all persistence 
is uniformly beneficial for learning. For this reason, it becomes 
increasingly pertinent to identify the key differences between 
unproductive and productive persistence toward informing 
interventions in computer-based learning environments. In this 
study, we attempt to address this by using classification models to 
distinguish between productive persistence and wheel-spinning in 
ASSISTments, an online math learning platform. Our results 
indicate that wheel-spinning is associated with shorter delays 
between solving problems of the same skill, more attempts to 
answer problems, and the heavy use of bottom out hints except for 
the first problem. These findings suggest that encouraging 
students to engage in spaced practice and avoid over-using 
bottom-out hints is likely helpful to reduce their wheel-spinning 
and improve learning. These findings also provide insight on 
which students are struggling and how to make students’ 
persistence more productive. 
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ABSTRACT 


Massive open online courses (MOOCs) provide educators 
with an abundance of data describing how students interact 
with the platform, but this data is highly underutilized to- 
day. This is in part due to the lack of sophisticated tools 
to provide interpretable and actionable summaries of huge 
amounts of MOOC activity present in log data. To ad- 
dress this problem, we propose a student behavior repre- 
sentation method alongside a method for automatically dis- 
covering those student behavior patterns by leveraging the 
click log data that can be obtained from the MOOC plat- 
form itself. Specifically, we propose the use of a two-layer 
hidden Markov model (2L-HMM) to extract our desired be- 
havior representation, and show that patterns extracted by 
such a 2L-HMM are interpretable, meaningful, and unique. 
We demonstrate that features extracted from a trained 2L- 
HMM can be shown to correlate with educational outcomes. 
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ABSTRACT 


As the use of educational technology becomes more ubiquitous, 
an enormous amount of learning process data is being produced. 
Educational data mining seeks to analyze and model these data, 
with the ultimate goal of improving learning outcomes. The most 
firmly grounded and rigorous evaluation of an educational data 
mining discovery is whether it yields better student learning when 
applied. Such an evaluation has been referred to as "closing the 
loop", as it completes cycle of system design, deployment, data 
analysis, and discovery leading back to design. Here, we present 
an instance of “closing the loop” on an automated cognitive 
modeling improvement discovered by Learning Factors Analysis 
(Cen, Koedinger, & Junker, 2006). We discuss our findings from 
a process in which we interpret the automated improvements 
yielded by the best-fitting cognitive model, validate the 
interpretation on novel data, use it to make changes to classroom- 
deployed educational technology, and show that the changes lead 
to significant learning gains relative to a control condition. 
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Zone out no more: Mitigating mind wandering during 


computerized reading 
Sidney K. D’Mello, Caitlin Mills, Robert Bixler, & Nigel Bosch 


University of Notre Dame 
118 Haggar Hall 
Notre Dame, IN 46556, USA 
sdmello@nd.edu 


ABSTRACT 


Mind wandering, defined as shifts in attention from task-related 
processing to  task-unrelated thoughts, is a ubiquitous 
phenomenon that has a negative influence on performance and 
productivity in many contexts, including learning. We propose 
that next-generation learning technologies should have some 
mechanism to detect and respond to mind wandering in real-time. 
Towards this end, we developed a technology that automatically 
detects mind wandering from eye-gaze during learning from 
instructional texts. When mind wandering is detected, the 
technology intervenes by posing just-in-time questions and 
encouraging re-reading as needed. After multiple rounds of 
iterative refinement, we summatively compared the technology to 
a yoked-control in an experiment with 104 participants. The key 
dependent variable was performance on a_ post-reading 
comprehension assessment. Our results suggest that the 
technology was successful in correcting comprehension deficits 
attributed to mind wandering (d = .47 sigma) under specific 
conditions, thereby highlighting the potential to improve learning 
by “attending to attention.” 


Keywords 
Mind wandering; gaze tracking; student modeling; attention- 
aware. 


1, INTRODUCTION 


Despite our best efforts to write a clear and engaging paper, 
chances are high that within the next 10 pages you might fall prey 
to what is referred to as zoning out, daydreaming, or mind 
wandering [45]. Despite your best intention to concentrate on our 
paper, at some point your attention might drift away to unrelated 
thoughts of lunch, childcare, or an upcoming trip. This prediction 
is not based on some negative or cynical opinion of the 
reader/reviewer (we read and review papers too), but on what is 
known about attentional control, vigilance, and concentration 
while individuals are engaged in complex comprehension 
activities, such as reading for understanding. 


One recent study tracked mind wandering of 5,000 individuals 
from 83 countries with a smartphone app that prompted people 
with thought-probes at random intervals throughout the day [24]. 
People reported mind wandering for 46.9% of the prompts, which 
confirmed lab studies on the pervasiveness of mind wandering 
(see [45] for a review). Mind wandering is more than merely 
incidental; a recent meta-analysis of 88 samples indicated a 
negative correlation between mind wandering and performance 
across a variety of tasks [34], a correlation which increases with 
task complexity. When compounded with its high frequency, 
mind wandering can have serious consequences on _ the 
performance and productivity of society at large. 


Mind wandering is also unfortunately an under-addressed 
problem in education and is yet to be deeply studied in the context 


of learning with technology. Traditional learning technologies 
rely on the assumption that students are attending to the learning 
session, although this is not always the case. For example, it has 
been estimated that students mind wander approximately 40% of 
the time when engaging with online lectures [38], which are an 
important component of MOOCs. Some advanced technologies 
do aim to detect and respond to affective states like boredom, but 
evidence for their effectiveness is still equivocal (see [9] for a 
review). Further, boredom is related to but not the same as 
attention [12]. There are technologies that aim to prevent mind 
wandering by engendering a highly immersive learning 
experience and have achieved some success in this regard [40, 
41]. But what is to be done when attentional focus inevitably 
wanes as the session progresses and the novelty of the system and 
content fades? 


Our central thesis is that next-generation learning technologies 
should include mechanisms to model and respond to learners’ 
attention in real-time [8]. Such attention-aware technologies can 
model various aspects of learner attention (e.g., divided attention, 
alternating attention). Here, we focus on detecting and mitigating 
mind wandering, a quintessential signal of waning engagement. 
We situate our work in the context of reading because reading is 
a common activity shared across multiple learning technologies, 
thereby increasing the generalizability of our results. Further, 
students mind wander approximately 30% of the time during 
computerized reading [44]. And although mind wandering can 
facilitate certain cognitive processes like future planning and 
divergent thinking [2, 28], it negatively correlates with 
comprehension and learning (reviewed in [31, 45]), suggesting 
that it is important to address mind wandering during learning. 


Towards this end, we developed and validated a closed-loop 
attention-aware learning technology that combines a machine- 
learned mind wandering detector with a real-time interpolated 
testing and re-study intervention. Our attention-aware technology 
works as follows. Learners read a text on a computer screen using 
a self-paced screen-by-screen (also called page-by-page) reading 
paradigm. We track eye-gaze during reading using a remote eye 
tracker that does not restrict head movements. We focus on eye- 
gaze for mind wandering detection due to decades of research 
suggesting a tight coupling between attentional focus and eye 
movements during reading [36]. When mind wandering is 
detected, the system intervenes in an attempt to redirect 
attentional focus and correct any comprehension deficits that 
might arise due to mind wandering. The interventions consist of 
asking comprehension question on pages where mind wandering 
was detected and providing opportunities to re-read based on 
learners’ responses. In this paper, we discuss the mind wandering 


Proceedings of the 10th International Conference on Educational Data Mining 8 


detector, intervention approach, and results of a summative 
evaluation study!. 


1.1 Related Work 


The idea of attention-aware user interfaces is not new, but was 
proposed almost a decade ago by Roda and Thomas [39]. There 
was even an article on futuristic applications of attention-aware 
systems in educational contexts [35]. Prior to this, Gluck, et al. 
[15] discussed the use of eye tracking to increase the bandwidth 
of information available to an intelligent tutoring system (ITS). 
Similarly, Anderson [1] followed up on some of these ideas by 
demonstrating how particular beneficial instructional strategies 
could only be launched via a real-time analysis of eye gaze. 


Most of the recent work has been on leveraging eye gaze to 
increase the bandwidth of learner models [22, 23, 29]. Conati, et 
al. [5] provide an excellent review of much of the existing work 
in this area. We can group the research into three categories: (1) 
offline-analyses of eye gaze to study attentional processes, (2) 
computational modeling of attentional states, and (3) closed-loop 
systems that respond to attention in real-time. Offline-analysis of 
eye movements has received considerable attention in cognitive 
and educational psychology for several decades [e.g., 16, 19], so 
this area of research is relatively healthy. Online computational 
models of learner attention are just beginning to emerge [e.g., 6, 
11], while closed-loop attention-aware systems are few and far 
between (see [7, 15, 42, 48] for a more or less exhaustive list). 
Two known examples, GazeTutor and AttentiveReview, are 
discussed below. 


GazeTutor [7] is a learning technology for biology. It has an 
animated conversational agent that provides spoken explanations 
on biology topics which are synchronized with images. The 
system uses a Tobii T60 eye tracker to detect inattention, which 
is assumed to occur when learners’ gaze is not on the tutor agent 
or image for at least five consecutive seconds. When this occurs, 
the system interrupts its speech mid utterance, directs learners to 
reorient their attention (e.g., “I’m over here you know”), and 
repeats speaking from the start of the current utterance. In an 
evaluation study, 48 learners (undergraduate students) completed 
a learning session on four biology topics with the attention-aware 
components enabled (experimental group) or disabled (control 
group). The results indicated that GazeTutor was successful in 
dynamically reorienting learners’ attentional patterns towards the 
interface. Importantly, learning gains for deep reasoning 
questions were significantly higher for the experimental vs. 
control group, but only for high aptitude learners. The results 
suggest that even the most basic attention-aware technology can 
be effective in improving learning, at least for a subset of learners. 
However, a key limitation is that the researchers simply assumed 
that off-screen gaze corresponded to inattention, but did not test 
this assumption (e.g., students could have been concentrating 
with their eyes closed and this would have been perceived as 
being inattentive). 


AttentiveReview [32] is a closed-loop system for MOOC learning 
on mobile phones. The system uses _ video-based 
photoplethysmography (PPG) to detect a learners’ heart rate from 
the back camera of a smartphone while they view MOOC-like 
lectures on the phone. AttentiveReview ranks the lectures based 


This paper reports updated results of an earlier version [10] presented 
as a “Late-Breaking Work” (LBW) poster at the 2016 ACM CHI 
conference. LBW “Extended Abstracts” are not included in the main 
conference proceedings and copyright is retained by the authors. 


on its estimates of learners’ “perceived difficulty,” selecting the 
most difficult lecture for subsequent review (called adaptive 
review). In a 32-participant between-subjects evaluation study, 
the authors found that learning gains obtained from the adaptive 
review condition were statistically on par with a full review 
condition, but were achieved in 66.7% less review time. Although 
this result suggests that AttentiveReview increased learning 
efficiency, there is the question as to whether the system should 
even be considered to be an “attention-aware” technology. This is 
because it is arguable if the system has anything to do with 
attention (except for “attention” appearing in its name) as it 
selects items for review based on a model of “perceived 
difficulty” and not on learners’ “attentional state.” The two might 
be related, but are clearly not the same. 


1.2 Novelty 

Our paper focuses on closing the loop between research on 
educational data and learning outcomes by developing and 
validating the first (in our view) real-time learning technology 
that detects and mitigates mind wandering during computerized 
reading. Although automated detection of complex mental states 
with the goal of developing intelligent learning technologies that 
respond to the sensed states is an active research area (See reviews 
by [9, 18]), mind wandering has rarely been explored as an aspect 
of a learner’s mental state that warrants detection and corrective 
action. And while there has been some work on modeling the 
locus of learner attention (see review by [5]), mind wandering is 
inherently different than more commonly studied forms of 
attention (e.g., selective attention, distraction), because it involves 
more covert forms of involuntary attentional lapses spawned by 
self-generated internal thought [45]. Simply put, mind wandering 
is a form of “looking without seeing” because the eyes might be 
fixated on the appropriate external stimulus, but very little is 
being processed as the mind is consumed by stimulus- 
independent internal thoughts. Offline automated approaches to 
detect mind wandering have been developed (e.g., [3, 11, 27, 33]), 
but these detectors have not yet been used to trigger online 
interventions. Here, we adapt an offline gaze-based automated 
mind wandering detector [13] to trigger real-time interventions to 
address mind wandering during reading. We conduct a 
randomized control trial to evaluate the efficacy of our attention- 
aware learning technology in improving learning. 


2. MIND WANDERING DETECTION 


We adopted a supervised learning approach for mind wandering 
detection. Below we provide a high-level overview of the 
approach; readers are directed to [3, 13] for a detailed discussion 
of the general approach used to build gaze-based detectors of 
mind wandering. 


2.1 Training Data 

We obtained training data from a previous study [26] that 
involved 98 undergraduate students reading a 57-page text on the 
surface tension of liquids [4] on a computer screen for an average 
of 28 minutes. The text contained around 6500 words, with an 
average of 115 words per page, and was displayed on a computer 
screen with Courier New typeface. We recorded eye-gaze with a 
Tobii TX300 eye tracker set to a sampling frequency of 120 Hz. 
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Participants could read normally and were free to move or gesture 
as they pleased. 


Participants were instructed to report mind wandering (during 
reading) by pressing a predetermined key when they found 
themselves “thinking about the task itself but not the actual 
content of the text” or when they were “thinking about anything 
else besides the task.” This is consistent with contemporary 
approaches (see [45]) that rely on self-reporting because mind 
wandering is an internal conscious phenomena. Further, self- 
reports of mind wandering have been linked to predictable 
patterns in physiology [43], pupillometry [14], eye-gaze [37], and 
task performance [34], providing validity for this approach. 


On average, we received mind wandering reports for 32% of the 
pages (SD = 20%), although there was considerable variability 
among participants (ranging from 0% to 82%). Self-reported 
mind wandering negatively correlated (r = -.23, p < .05) with 
scores on a subsequent comprehension assessment [26], which 
provides evidence for the predictive validity of the self-reports. 


2.2 Model Building 


The stream of eye-gaze data was filtered to produce a series of 
fixations, saccades, and blinks, from which global eye gaze 
features were extracted (see Figure 1). Global features are 
independent of the words being read and are therefore more 
generalizable than so-called local features. A full list of 62 global 
features along with detailed descriptions is provided in [13], but 
briefly the features can be grouped into the following four 
categories: (1) Eye movement descriptive features (n = 48) were 
Statistical functionals (e.g., min, median) for fixation duration, 
saccade duration, saccade amplitude, saccade velocity, and 
relative and absolute saccade angle distributions; (2) Pupil 
diameter descriptive features were statistical functionals (n = 8) 
computed from participant-level z-score standardized estimates 
of pupil diameter; (3) Blink features (n = 2) consisted of the 
number of blinks and the mean blink duration; (4) Miscellaneous 
gaze features (n = 4) consisted of the number of saccades, 
horizontal saccade proportion, fixation dispersion, and the 
fixation duration/saccade duration ratio. We proceeded with a 
subset of 32 features after eliminating features exhibiting 
multicollinearity. 


Features were calculated from only a certain amount of gaze data 
from each page, called the window. The end of the window was 
positioned 3 seconds before a self-report so as to not overlap with 
the key-press. The average amount of time between self-reports 
and the beginning of the page was 16 seconds. We used this time 
point as the end of the window for pages with no self-report. 
Pages that were shorter than the target window size were 
discarded, as were pages with windows that contained fewer than 
five gaze fixations as there was insufficient data to compute some 
of the features. There were a total of 4,225 windows with 
sufficient data for supervised classification. 


We experimented with a number of supervised classifiers on 
window sizes of 4, 8, and 12 seconds to discriminate positive 
(pages with a self-report = 32%) from negative (pages without a 
self-report) instances of mind wandering. The training data were 
downsampled to achieve a 50% base rate; testing data were 
unaltered. A leave-one-participant-out validation approach was 
adopted where models were built on data from n-1 participants 
and evaluated on the held-out participant. The process was 
repeated for all participants. Model validation was conducted in a 
way to simulate a real-time system by analyzing data from every 
page. When classification was not possible due to a lack of valid 
gaze data and/or because participants did not spend enough time 


on the page, we classified the page as a positive instance of mind 
wandering. This was done because analyses indicated that 
participants were more likely to be mind wandering in those cases 
(but see [13] for alternate strategies to handle missing instances). 
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Figure 1: Gaze fixations during mind wandering (top) 
and normal reading (bottom) 


2.3 Detector Accuracy 

The best model was a support vector machine that used global 
features and operated on a window size of 8-seconds. The area 
under the ROC curve (AUC or AUROC or A’) was .66, which 
exceeds the 0.5 chance threshold [17]. 


We assigned each instance as mind wandering or not mind 
wandering based on whether the detector’s predicted likelihood 
of mind wandering (ranges from 0 to 1) was below or above 0.5 
We adopted the default 0.5 threshold as it led to a higher rate of 
true positives while maintaining a moderate rate of true negatives. 
This resulted in the following confusion matrix shown in Table 1. 
The model had a weighted precision of 72.2% and a weighted 
recall of 67.4%, which we deemed to be sufficiently accurate for 
intervention. 


Table 1: Proportionalized confusion matrix for mind 
wandering detection 


Predicted mind wandering (MW) 
Actual MW yes no 


yes 0.715 nit) 0.285 (miss) 


no 0.346 (false positive) 0.654 (correct rejection) 


3. Intervention to Address Mind Wandering 
Our intervention approach is grounded in the basic idea that 
learning of conceptual information involves creating and 
maintaining an internal model (mental model) by integrating 
information from the text with prior knowledge from memory 
[25]. This integration process relies on attentional focus and 
breaks down during mind wandering because information from 
the external environment is no longer being integrated into the 
internal mental model. This results in an impaired model which 
leads to less effective suppression of off-task thoughts. This 
increase in mind wandering further impairs the mental model, 
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resulting in a vicious cycle. Our intervention targets this vicious 
cycle by redirecting attention to the primary task and attempting 
to correct for comprehension deficits attributed to mind 
wandering. Based on research demonstrating the effectiveness of 
interpolated testing [47], we propose that asking questions on 
pages where mind wandering is detected and encouraging re- 
reading in response to incorrect responses will aid in re-directing 
attention to the text and correct knowledge deficits. 


3.1 Intervention Implementation 

Our initial intervention was implemented for the same text used 
to create the mind wandering detector (although it could be 
applied to any text). The text was integrated into the computer 
reading interface. Mind wandering detection occurred when the 
learner navigated to the next page using the right arrow key. In 
order to address ambiguity in mind wandering detection, we used 
the detector’s mind wandering likelihood to probabilistically 
determine when to intervene. For example, if the mind wandering 
likelihood was 70%, then there was a 70% chance of intervention 
on any given page (all else being equal). We did not intervene for 
the first three pages in order to allow the learner to become 
familiar with the text and interface. To reduce disruption, there 
was a 50% reduced probability of intervening on adjacent pages, 
and the maximum number of interventions was capped at 1/3 x 
the number of pages (19 for the present 57-page text). Table 2 
presents pseudo code for when to launch an intervention. 


Table 2: Pseudo code for intervention strategy 


launch_intervention: 
if current_page >= WAITPAGES 


and 

total_interventions < MAXINTRV) 
and 

gaze_likelihood > random(0,1) 
and 


(!has_intervened(previous_page) 
or 0.5 < random (0,1)): 
do_intervention() 
else: 
show_next_page( ) 


do_intervention: 
answer1 = show_question1() 
if answeri1 is correct: 
show_positive_feedback( ) 
show_next_page( ) 
else: 
show_neg_feedback( ) 
suggest_rereading() 
if page advance detected: 
answer2 = show_question2(); 
show_next_page() 


Figure 2 presents an outline of the intervention strategy. The 
intervention itself relied on two multiple choice questions for 
each page (screen) of the text. When the system decided to 
intervene, one of the questions (randomly selected) was presented 
to the learner. If the learner answered this online question 
correctly, positive feedback was provided, and the learner could 
advance to the next page. If the learner answered incorrectly, 
negative feedback was provided, and the system encouraged the 
learner to re-read the page. The learner was then provided with a 
second (randomly selected) online question, which could either 
be the same or the alternate question for that page. Feedback was 
not provided and the learner was allowed to advance to the next 


page regardless of whether the second question was answered 
correctly, so as not to be overly burdensome. 
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Figure 2: Outline of intervention strategy 


3.2 Iterative Refinement 

The technology was refined through multiple rounds of formative 
testing with 67 participants, recruited from the same institution 
used to build the detector. Participants were observed while 
interacting with the technology, their responses were analyzed, 
and they were interviewed about their experience. We used the 
feedback gleaned from these tests to refine the intervention 
parameters (i.e., when to launch, how many interventions to 
launch, whether to launch interventions on subsequent pages), 
intervention questions themselves, and instructions on how to 
attend to the intervention. For example, earlier versions of the 
intervention used a fixed threshold (instead of the aforementioned 
probabilistic approach) to trigger an intervention. Despite many 
attempts to set this threshold, the end result was that some 
participants received many interventions while others received 
almost no interventions. This issue was corrected by 
probabilistically rather than deterministically launching the 
intervention. Additional testing/refinement of the comprehension 
questions used in the intervention was done using crowdsourcing 
platforms, specifically Amazon’s Mechanical Turk (MTurk). 


4. Evaluation Study 

We conducted a randomized controlled trial to evaluate the 
technology. The experiment had two conditions: an intervention 
condition and a yoked control condition (as described below). The 
yoked control was needed to verify that any learning benefits are 
attributed to the technology being sensitive to mind wandering 
and not merely to the added opportunities to answer online 
questions and re-read. This is because we know that interpolated 
testing itself has beneficial comprehension effects [47]. 


4.1 Method 


Participants (N = 104) were a new set of undergraduate students 
who participated to fulfill research credit requirements. They 
were recruited from the same university used to build the MW 
detector and for the iterative testing and refinement cycles. 


We did not use a pretest because we expected participants to be 
unfamiliar with the topic. Participants were not informed that the 
interface would be tracking their mind wandering (until the 
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debriefing at the end), Instead, they were instructed as follows: 
“While reading the text, you will occasionally be asked some 
questions about the page you just read. Depending on your 
answer, you will re-read the same page and you will be asked 
another question that may or may not be the same question.” 


Participants in the intervention condition received the 
intervention as described above (i.e., based on detected mind 
wandering likelihoods). Each participant in the yoked control 
condition was paired with a participant in the intervention 
condition. He or she received an intervention question on the 
same pages as their paired intervention participant regardless of 
mind wandering likelihood. For example, if participant A (i.e., 
intervention condition) received questions on pages 5, 7, 10, and 
25, participant B (i.e., yoked control condition) would receive 
intervention questions on the same pages. However, if the yoked 
participant answered incorrectly, then (s)he had the opportunity 
to re-read and answer another question regardless of the outcome 
of their intervention-condition partner. 


After reading, participants completed a 38-item multiple choice 
comprehension assessment to measure learning. The questions 
were randomly selected from the 57 pages (one per page) with the 
exception that a higher selection priority was given to pages that 
were re-read on account of the intervention. Participants in the 
yoked control condition received the same posttest questions as 
their intervention condition counterparts. 


4.2 Results 


Participants received an average of 16 (min of 7 and max of 19) 
interventions. They spent an average of 27.5 seconds on each 
screen prior to receiving an intervention. There was no significant 
difference across conditions (p = .998), suggesting that reading 
time was not a confound. In what follows, we compared each 
intervention participant to his/her yoked control with a two-tailed 
paired-samples t-test and a 0.05 criteria for statistical 
significance. 


Mind wandering detection. The detector’s likelihood of mind 
wandering was slightly higher for participants in the yoked- 
control condition (M = .431; SD = .170) compared to the 
intervention condition (M = .404; SD = .112), but the difference 
was not statistically significant (p = .348). This was unsurprising 
as participants in both groups received the same interventions, 
which itself was expected to reduce mind wandering. Importantly, 
mind wandering likelihoods were negatively correlated with 
performance on the online questions (r = -.296, p = .033) as well 
as on posttest questions (r = -.319, p = .021). This provides 
evidence for the validity of the mind wandering detector when 
applied to a new set of learners and under different conditions 
(i.e., reading interspersed with online questions compared to 
uninterrupted reading). 


Comprehension assessment. There was some overlap between 
the online questions and the posttest questions. To obtain an 
unbiased estimate of learning, we only analyzed performance on 
previously unseen posttest questions. That is, questions that were 
used as part of the intervention were first removed before 
computing posttest scores. 


There were no significant condition differences on overall 
posttest scores (p = .846). The intervention condition answered 
57.6% (SD = .157) of the questions correctly while the yoked 
control condition answered 58.1% (SD = .129) correctly. This 
finding was not surprising as both conditions received the exact 
Same treatment except that the interventions were triggered based 


on detected mind wandering in the intervention condition but not 
the control condition. 


Next, we examined posttest performance as a function of mind 
wandering during reading. Each page was designated as a low or 
high mind wandering page based on a median split of mind 
wandering likelihoods (medians = .35 and .36 on a0 to 1 scale for 
intervention and control conditions, respectively). We then 
analyzed performance on posttest questions corresponding to 
pages with low vs. high likelihoods of mind wandering (during 
reading). The results are shown in Table 3. 


We found no significant posttest differences on pages where both 
the intervention and control participants had low (p = .759) or 
high (p = .922) mind wandering likelihoods (first and last rows in 
Table 3, respectively). There was also no significant posttest 
difference (p = .630) for pages where the intervention condition 
had high mind wandering likelihoods but the control condition 
had low mind wandering likelihoods (row 3). However, the 
intervention condition significantly (p = .003, d = .47 sigma) 
outperformed the control condition for pages where the 
intervention participants had low likelihoods of mind wandering 
but control participants had high mind wandering likelihoods 
(row 2). These last two finding suggests that the intervention had 
the intended effect of reducing comprehension deficits 
attributable to mind wandering because it led to equitable 
performance when mind wandering was high and improved 
performance when it was low. 


Table 3: Posttest performance (proportion of correct 
responses) as a function of mind wandering during reading. 
Standard deviations in parenthesis. 


Mind Posttest 

wandering evan 
N Int. Cntrl. Int. Cntr. 
43 Low Low .604 (.288) .623 (.287) 
40 Low High .643 (.263) .489 (.298) 
43 High Low .935 (.295) .966 (.305) 
45 High High 522 (.312) 915 (.291) 


Note. Int. = intervention. Cntrl. = control. Bolded cells represent a 


statistically significant difference. N = number of pairs (out of 52) in each 
analysis. It differs slightly across analyses as not all participants were 
assigned to each mind wandering group. 


After-task interview. We interviewed a subset of the participants 
in order to gauge their subjective experience with the 
intervention. A few key themes emerged. Participants reported 
paying closer attention to the text after realizing they would be 
periodically answering multiple-choice questions. This was good. 
However, participants also reported that they adapted their 
reading strategies in one of two ways in response to the questions. 
Since the questions targeted factual information (sometimes 
verbatim) from the text, some participants paid more attention to 
details and precise wordings instead of the broader concepts being 
discussed in the text. More discouragingly, some participants 
reported adopting a preemptive skimming strategy in that they 
would only look for keywords that they expected to appear in a 
subsequent question. 


Participants were encouraged to re-read text when they answered 
incorrectly before receiving another question (or the same 
question in some cases). Many participants reported simply 
scanning the text (when re-reading) to locate keywords from the 
question before moving on. Since the scanning strategy was often 
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successful to answer the subsequent question, participants 
reported that the questions were too easy and it took relatively 
little effort to locate the correct answer compared to re-reading. 
They suggested that it may have been better if the questions had 
targeted key concepts rather than facts. 


Finally, participants reported difficulties with re-engaging with 
the text after answering an online question because the text was 
cleared when an intervention question was displayed; an item that 
can be easily corrected in subsequent versions. 


5. Discussion 

We developed the first educational technology capable of real- 
time mind wandering detection and dynamic intervention during 
computerized reading. In the remainder of this section, we discuss 
the significance of our main findings, limitations, and avenues for 
future work. 


5.1 Significance of Main Findings 

We have three main findings. First, we demonstrated that a 
machine-learned mind wandering detector built in one context 
can be applied to a different (albeit related) interaction context. 
Specifically, the detector was trained on a data set involving 
participants silently reading and self-reporting mind wandering, 
but was applied to an interactive context involving interpolated 
assessments, which engendered different reading strategies. 
Further, self-reports of mind wandering were not collected in this 
interactive context, which might have influenced mind wandering 
rates in and of itself. Despite these differences, we were able to 
demonstrate the predictive validity of the detector by showing 
that it negatively correlated with both online and offline 
comprehension scores when evaluated on new participants. 


Second, we showed promising effects for our intervention 
approach despite a very conservative experimental design, which 
ensured that the intervention and control groups were equated 
along all respects, except that the intervention was triggered based 
on the mind wandering detector (key manipulation). Further, we 
used a probabilistic approach to trigger an intervention, because 
the detector is inherently imperfect. As a result, participants could 
have received an intervention when they were not mind 
wandering and/or could have failed to receive one when they were 
mind wandering. Therefore, it was essential to compare the two 
groups under conditions when the mind wandering levels 
differed. This more nuanced analysis revealed that although the 
intervention itself did not lead to a boost in overall comprehension 
(because it is remedial), it equated comprehension scores when 
mind wandering was high (i.e., scores for the intervention group 
were comparable when the control group was low on mind 
wandering). It also demonstrated the cost of not intervening 
during mind wandering (i.e., scores for the intervention group 
were greater when the control group was high on mind 
wandering). In other words, the intervention was successful in 
mitigating the negative effects of mind wandering. 


Third, despite the advantages articulated above, the intervention 
itself was reactive and engendered several unintended (and 
presumably suboptimal) behaviors. In particular, students altered 
their reading strategies in response to the interpolated questions, 
which were a critical part of the intervention. In a sense, they 
attempted to “game the intervention” by attempting to proactively 
predict the types of questions they might receive and then 
adopting a complementary reading strategy consisting of 
skimming and/or focusing on factual information. This reliance 
on surface- rather than deeper-levels of processing was 
incongruent with our goal of promoting deep comprehension. 


5.2 Limitations 

There are a number of methodological limitations with this work 
that go beyond limitations with the intervention (as discussed 
above). First, we focused on a single text that is perceived as 
being quite dull and consequently triggers rather high levels of 
mind wandering [26]. This raises the question of whether the 
detector will generalize to different texts. We expect some level 
of generalizability in terms of features used because the detector 
only used content- and position- (on the screen) free global gaze 
features. However, given that several supervised classifiers are 
very sensitive to differences in base rates, the detector might over- 
or under- predict mind wandering when applied to texts that 
engender different rates of mind wandering. Therefore, retraining 
the detector with a more diverse set of texts is warranted. 


Another limitation is the scalability of our learning technology. 
The eye tracker we used was a cost-prohibitive Tobii TX300 that 
will not scale beyond the laboratory. Fortunately, commercial- 
off-the-shelf (COTS) eye trackers, such as Eye Tribe and Tobii 
EyeX, can be used to surpass this limitation. It is an open question 
as to whether the mind wandering detector can operate with 
similar fidelity with these COTS eye trackers. Our use of global 
gaze features which do not require high-precision eye tracking 
holds considerable promise in this regard. Nevertheless, 
replication with scalable eye trackers and/or scalable alternatives 
to eye tracking (e.g., facial-feature tracking [46] or monitoring 
reading patterns [27]) is an important next step (see Section 5.3). 


Our use of surface-level questions for both the intervention and 
the subsequent comprehension assessment is also a limitation as 
is the lack of a delayed comprehension assessment. It might be 
the case that the intervention effects manifest as richer encodings 
in long-term memory, a possibility that cannot be addressed in the 
current experiment that only assessed immediate learning. 


Other limitations include a limited student sample (i.e. 
undergraduates from a private Midwestern college) and a 
laboratory setup. It is possible that the results would not 
generalize to a more diverse student population or in more 
ecological environments (but see below for evidence of 
generalizability of the detector in classroom environments). 
Replication with data from more diverse populations and 
environments would be a necessary next step to increase the 
ecological validity of this work. 


5.3 Future Work 


Our future work is progressing along two main fronts. One is to 
address limitations in the intervention and design of the 
experimental evaluation as discussed above. Accordingly, we are 
exploring alternative intervention strategies, such as: (a) tagging 
items for future re-study rather than interrupting participants 
during reading; (b) highlighting specific portions of the text as an 
overt cue to facilitate comprehension of critical information; (c) 
asking fewer intervention questions, but selecting inference 
questions that target deeper levels of comprehension and that span 
multiple pages of the text; and (d) asking learners to engage in 
reflection by providing written self-explanations of the textual 
content. We are currently evaluating one such redesigned 
intervention — open-ended questions targeting deeper levels of 
comprehension (item c). Our revised experimental design taps 
both surface- and inference-level comprehension and assesses 
comprehension immediately after reading (to measure learning) 
and after a one-week delay (to measure retention). 


We are also developing attention-aware versions of more 
interactive interfaces, such as learning with an intelligent tutoring 
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system called GuruTutor [30]. This project also addresses some 
of the scalability concerns by replacing expensive research-grade 
eye tracking with cost-effective COTS eye tracking (e.g., the Eye 
Tribe or Tobii EyeX) and provides evidence for real-world 
generalizability by collecting data in classrooms rather than the 
lab. We recently tested our implementation on 135 students (total) 
in a noisy computer-enabled high-school classroom where eye- 
gaze of entire classes of students was collected during their 
normal class periods [20]. Using a similar approach to the present 
work, we used the data to build and validate a student- 
independent gaze-based mind wandering detector. The resultant 
mind wandering detection accuracy (Fi of 0.59) was substantially 
greater than chance (F1 of 0.24) and outperformed earlier work on 
the same domain [21]. The next step is to develop interventions 
that redirect attention and correct learning deficiencies 
attributable to mind wandering and to test the interventions in 
real-world environments. By doing so, we hope to advance our 
foundational vision of developing next-generation technologies 
that enhance the process and products of learning by “attending 
to attention.” 
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Figure 3: Guru Tutor interface overlaid with eye-gaze 
obtained via the EyeTribe 
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ABSTRACT 


Educational systems typically contain a large pool of items 
(questions, problems). Using data mining techniques we can 
group these items into knowledge components, detect du- 
plicated items and outliers, and identify missing items. To 
these ends, it is useful to analyze item similarities, which can 
be used as input to clustering or visualization techniques. 
We describe and evaluate different measures of item similar- 
ity that are based only on learners’ performance data, which 
makes them widely applicable. We provide evaluation using 
both simulated data and real data from several educational 
systems. The results show that Pearson correlation is a suit- 
able similarity measure and that response times are useful 
for improving stability of similarity measures when the scope 
of available data is small. 


1. INTRODUCTION 


Interactive educational systems offer learners items (prob- 
lems, questions) for solving. Realistic educational systems 
typically contain a large number of such items. This is par- 
ticularly true for adaptive systems, which try to present suit- 
able items for different kinds of learners. The management 
of a large pool of items is difficult. However, educational 
systems collect data about learners’ performance and the 
data can be used to get insight into item properties. In this 
work we focus on methods for computing item similarities 
based on learners’ performance data, which consists of bi- 
nary information about the answers (correct/incorrect). 


Automatically detected item similarities are the first and 
necessary step in further analysis such as clustering of the 
items, which is useful in several ways, with one particular 
application being learner modeling [9]. Learner models es- 
timate knowledge and skills of learners and are the basis 
of adaptive behavior of educational systems. A learner’s 
models requires a mapping of items into knowledge compo- 
nents [17]. Item clusters can serve as a basis for knowledge 
component definition or refinement. The specified knowl- 
edge components are relevant not only for modeling, but 
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they are typically directly visible to learners in the user in- 
terface of a system, e.g., in a form of open learner model 
visualizing the estimated knowledge state, or in a personal- 
ized overview of mistakes, which is grouped by knowledge 
components. 


Information about items is also very useful for management 
of the content of educational systems — preparation of new 
items, filtering of unsuitable items, preparation of explana- 
tions, and hint messages. Information about item similari- 
ties and clusters can be also relevant for teachers as it can 
provide them an inspiration for “live” discussions in class. 
This type of applications is in line with Baker’s argument [1] 
for focusing on the use of learning analytics for “leveraging 
human intelligence” instead of its use for automatic intelli- 
gent methods. 


Item similarities and clusters are studied not only in ed- 
ucational data mining but also in a closely related area of 
recommender systems. The setting of recommender systems 
is in many aspects very similar to educational systems — in 
both cases we have users and items, just instead of “perfor- 
mance” (the correctness of answers, the speed of answers) 
recommender systems consider “ratings” (how much a user 
likes an item). Item similarities and clustering techniques 
have thus been also considered in the recommender systems 
research (we mention specific techniques below). There is a 
slight, but important difference between the two areas. In 
recommender systems item similarities and clusterings are 
typically only auxiliary techniques hidden within a “recom- 
mendation black box”. In educational system, it is useful to 
make these results explicitly available to system developers, 
curriculum production teams, or teachers. 


There are two basic approaches to dealing with item similar- 
ities and knowledge components: a “model based approach” 
and an “item similarity approach”. The basic idea of the 
model based approach is to construct a simplified model that 
explains the observed data. Based on a matrix of learners’ 
answers to items we construct a model that predicts these 
answers. ‘Typically, the model assigns several latent skills to 
learners and uses a mapping of items to corresponding latent 
factors. ‘This kind of models can often be naturally expressed 
using matrix multiplication, i.e., fitting a model leads to ma- 
trix factorization. Once we fit the model to data, items that 
have the same value of a latent factor can be denoted as 
“similar”. This approach leads naturally to multiple knowl- 
edge components per skill. ‘The model is typically computed 
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using some optimization technique that leads only to local 
optima (e.g., gradient descent). It is thus necessary to ad- 
dress the role of initialization, and parameter setting of the 
search procedure. In recommender systems this approach is 
used for implementation of collaborative filtering; it is often 
called “singular value decomposition” (SVD) [18]. In edu- 
cational context many variants of this approach have been 
proposed under different names and terminology, e.g., Q- 
matrix [3], non-negative matrix factorization techniques [8], 
sparse factor analysis [19], or matrix refinement [10]. 


With the item similarity approach we do not construct an 
explicit model of learners’ behavior, but we compute directly 
a similarity measure for each pairs of items. These similar- 
ities are then used to compute clusters of items, to project 
items into a plane, or for other analysis (e.g., for each item 
listing the 3 most similar items). This approach naturally 
leads to a mapping with a single knowledge component per 
item (i.e., different kind of output from most model based 
methods). One advantage of this approach is easier inter- 
pretability. In recommender system research this approach 
is called neighborhood-based methods [11] or item-item col- 
laborative filtering [7]. Similarity has been used for clus- 
tering of items [23, 24] and also for clustering of users [29]. 
In educational setting item similarity has been analyzed us- 
ing correlation of learners’ answers [22] and problem solving 
times [21], and also using learners’ wrong answers [25]. 


So far we have discussed methods that are based only on 
data about learners’ answers. Often we have some additional 
information about items and their similarities, e.g., a man- 
ual labeling or data based on syntactic similarity of items 
(text of questions). For both model based and item similar- 
ity approaches previous research has studied techniques for 
combination of these different types of inputs [10, 21]. 


In this work we focus on the item similarity approach, be- 
cause in the educational setting this approach is less ex- 
plored than the model based approach. We discuss specific 
techniques, clarify details of their usage, and provide evalua- 
tion using both data from real learners and simulated data. 
Simulated data are useful for evaluation of the considered 
unsupervised machine learning tasks, because in the case of 
real-world data we do not know the “ground truth”. 


The specific contributions of this work are the following. We 
provide guidelines for the choice of item similarity measures 
— we discuss different options and provide results identifying 
suitable measures (Pearson, Yule, Cohen); we also demon- 
strate the usefulness of “two step similarity measures”. We 
explore benefits of the use of response time information as 
supplement to usual information of correctness of answer. 
We use and discuss several evaluation methods for the con- 
sidered tasks. We specifically consider the issue of “how 
much data do we need”. This is often practically more im- 
portant than the exact choice of a used technique, but the 
issue is rather neglected in previous work. 


2. MEASURES OF ITEM SIMILARITY 


Figure 1 provides a high-level illustration of the item sim- 
ilarity approach. This approach consist of two steps that 
are to a large degree independent. At first, we compute an 
item similarity matrix, i.e., for each pair of items 7,7 we 
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Figure 1: High-level illustration of the general ap- 
proach to item analysis based on item similarities. 


compute similarity s;; of these items. At second, we can 
construct clusters or visualizations of items using only the 
item similarity matrix. 


Experience with clustering algorithms suggests that the ap- 
propriate choice of similarity measure is more important 
than choice of clustering algorithm [13]. The choice of simi- 
larity measure is domain specific and it is typically not ex- 
plored in general research on clustering. Therefore, we focus 
on the first step — the choice of similarity measure — and ex- 
plore it for the case of educational data. 


2.1 Basic Setting 


In this work we focus on computing item similarities using 
learners’ performance data. As Figure 1 shows, the simi- 
larity computation can also utilize information from domain 
experts or automatically determined information based on 
the inner structure of items (e.g., text of questions or some 
available meta-data). 


We discuss different possibilities for computation of item 
similarities. Note that in our discussion we consistently use 
“similarity measures” (higher values correspond to higher 
similarity), some related works provide formulas for dissim- 
ilarity measures (distance of items; lower values correspond 
to higher similarity). This is just a technical issue, as we can 
easily transform similarity into dissimilarity by subtraction. 


The input to item similarity computation are data about 
learner performance, i.e., a matrix L x I, where L is the 
number of learners and J is the number of items. The ma- 
trix values specify learners’ performance. The matrix is typ- 
ically very sparse (many missing values). The output of the 
computation is an item similarity matrix, which specifies 
similarity for each pair of items. 


Note that in our discussion we mostly ignore the issue of 
learning (change of learners skill as they progress through 
items). When learning is relatively slow and items are pre- 
sented in a randomized order, learning is just a reasonably 
small source of noise and does not have a fundamental im- 
pact on the computation of item similarities. In cases where 
learning is fast or items are presented in a fixed order, it 
may be necessary to take learning explicitly into account. 


2.2 Correctness of Answers 
The basic type of information available in educational sys- 
tems is the correctness of learners’ answers. So we start with 
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similarity measures that utilize only this type of informa- 
tion, i.e., dichotomous data (correct /incorrect) on learners’ 
answers on items. The advantage of these measures is that 
they are applicable in wide variety of settings. 


With dichotomous data we can summarize learners’ perfor- 
mance on items 7 and j using an agreement matrix with 
just four values (Table 1). Although we have just four val- 
ues to quantify the similarity of items 7 and j, previous re- 
search has identified large number of different measures for 
dichotomous data and analyzed their relations [5, 12, 20]. 
For example Choi et al. [5] discuss 76 different measures, al- 
beit many of them are only slight variations on one theme. 
Similarity measures over dichotomous data are often used in 
biology (co-occurrence of species) [14]. A more directly rele- 
vant application is the use of similarity measures for recom- 
mendations [30]. Recommender systems typically use either 
Pearson correlation or cosine similarity for computation of 
item similarities [11], but they consider richer than binary 
data. 


Table 1: An agreement matrix for two items and def- 
initions of similarity measures based on the agree- 
ment matrix (n=a+b+c+4d is the total number of 
observations). 


item 2 
incorrect correct 
item j incorrect a b 
correct C d 
Yule Sy = (ad — bc)/(ad + bc) 
Pearson S, = (ad — bc)/,\/(a+6)(a+c)(b+d)(c+d) 
Cohen S.=(Po— P-)/(1— Pe) 
P, = (a+d)/n 
Pe = ((a+ b)(a +e) + (b+d)(e+d))/n 
Sokal Ss =(a+d)/(a+b+c+d) 
Jaccard S$; =a/(a+b+c) 


Ochiai Sp = a/,/(a +b)(a +0) 


Table 1 provides definitions of 6 measures that we have cho- 
sen for our comparison. In accordance with previous re- 
search (e.g., [5, 14]) we call measures by names of researchers 
who proposed them. The choice of measures was done in 
such a way as to cover measures used in the most closely re- 
lated work and measures which achieved good results (even 
if the previous work was in other domains). We also tried 
to cover different types of measures. 


Pearson measure is the standard Pearson correlation coef- 
ficient evaluated over the dichotomous data. In the con- 
text of dichotomous data it is also called Phi coefficient or 
Matthews correlation coefficient. Yule measure is similar 
measure, which achieved good results in previous work [30]. 
Cohen measure is typically used as a measure of inter-rater 
agreement (it is more commonly called “Cohen’s kappa”). 
In our setting it makes sense to consider this measure when 


we view learners’ answers as “ratings” of items. Relations 
between these three measures are discussed in [32]. 


Ochiai coefficient is typically used in biology [14]. It is also 
equivalent to cosine similarity evaluated over dichotomous 
data; cosine similarity is often used in recommender sys- 
tems for computing item similarity, albeit typically over in- 
terval data [7]. Sokal measure is also called Sokal-Michener 
or “simple matching”. It is equivalent to accuracy measure 
used in information retrieval. Together with Jaccard mea- 
sure they are often used in biology, but they have also been 
used for clustering of educational data [12]. 


Note that some similarity measures are asymmetric with re- 
spect to 0 and 1 values. These measures are typically used 
in contexts where the interpretation of binary values is pres- 
ence/absence of a specific feature (or observation). In the 
educational context it is more natural to use measures which 
treat correct and incorrect answers symmetrically. Never- 
theless, for completeness we have included also some of the 
commonly used asymmetric measures (Ochiai and Jaccard). 
In these cases we focus on incorrect answers (value a as op- 
posed to d) as these are typically less frequent and thus bear 
more information. 


2.3 Other Data Sources 


The correctness of answers is the basic source of informa- 
tion about item similarities, but not the only one. We 
can also use other data. The second major type of per- 
formance data is response time (time taken to answer an 
item). The basic approach to utilization of response time 
is to combine it with the correctness of an answer. Given 
the correctness value c € {0,1}, a response time ¢ € R™, 
and the median of all response times T, we combine them 
into a single score r. Examples of such transformations 
are: linear transformation for correct answers only (r = 
c:-maxz(1 — t/27, 0)); exponential discounting used in Mat- 
Mat [28] (r = c- min(1, 0.9°/7~')); linear transformation 
inspired by high speed, high stakes scoring rule used in Math 
Garden [16] (r = (2c—1)-maaz(1 — t/27, 0)). The first 
approach was used in our experiment due to its simplicity 
and high influence of response time information. 


The scores obtained in this way are real numbers. Given the 
scores it is natural to compute similarity of two items using 
Pearson correlation coefficient of scores (over learners who 
answered both items). It is also possible to utilize specific 
wrong answers for computation of item similarity [25]. 


It is also possible to combine performance based measures 
with other types of data. For example we may estimate 
item similarity based on analysis of the content of items 
(syntactical similarity of texts), or collect expert opinion 
(manual categorization of items into several groups). The 
advantage of the similarity approach (compared to model 
based approach) is that different similarity measures can be 
usually combined in straightforward way by using a weighted 
average of different measures. 


2.4 Second Level of Item Similarity 

The basic computation of item similarities computes simi- 
larity of items 7 and 7 using only data about these two items. 
To improve a similarity measure, it is possible to employ a 
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“second of level of item similarity” that is based on the com- 
puted item similarity matrix and uses information on all 
items. Examples of such a second step is Euclidean distance 
or correlation. Similarity of items 2 and 7 is given by the 
Euclidean distance or Pearson correlation of rows 7 and 7 
in the similarity matrix. Note that Euclidean distance may 
be used implicitly when we use standard implementation of 
some clustering algorithms (e.g., k-means). 


With the basic approach to item similarity, we consider 
items similar when performance of learners on these items is 
similar. With the second step of item similarity, we consider 
two items similar when they behave similarly with respect 
to other items. The main reason for using this second step 
is the reduction of noise in data by using more informa- 
tion. This may be useful particularly to deal with learning. 
Two very similar items may have rather low direct similar- 
ity, because getting a feedback on the first item can strongly 
influence the performance on the second item. However, we 
expect both items to have similar similarities to other items. 


A more technical reason to using the second step (partic- 
ularly the Euclidean distance) is to obtain a measure that 
is a distance metric. ‘The measures described above mostly 
do not satisfy triangle inequality and thus do not satisfy 
the requirements on distance metric; this property may be 
important for some clustering algorithms. 


3. EVALUATION 


In this work we focus on item similarity, but we keep the 
overall context depicted in Figure 1 in mind. The quality of 
a visualization is to a certain degree subjective and difficult 
to quantify, but the quality of clusters can be quantified and 
thus we can use it to compare similarity measures. From 
the large pool of existing clustering algorithms [15] we con- 
sider k-means, which is the most common implementation 
of centroid-based clustering, and hierarchical clustering. We 
used agglomerative or “bottom up” approach where items 
are successively merged to clusters using Ward’s method as 
linkage criteria. 


3.1 Data 


We use data from real educational systems as well as sim- 
ulated learner data. Real-world data provide information 
about the realistic performance of techniques, but the eval- 
uation is complicated by the fact that we do not know the 
“ground truth” (the “correct” similarity or clusters of items). 
Simulated data provide a setting that is in many aspects 
simplified but allows easier evaluation thanks to the access 
to the ground truth. 


For generating simulated data we use a simple approach 
with minimal number of assumptions and ad hoc param- 
eters. Each item belongs to one of k knowledge compo- 
nents. Each knowledge component contains n items. Each 
item has a difficulty generated from the standard normal 
distribution d; ~ N(0,1). Skills of learners with respect to 
individual knowledge components are independent. Skill of 
a learner | with respect to knowledge component J is gen- 
erated from the standard normal distribution 6,; ~ N(0, 1). 
We assume no learning (constant skills). Answers are gen- 
erated as Bernoulli trials with the probability of a correct 
answer given by the logistic function of the difference of a 
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Table 2: Data used for analysis. 


learners items answers 
Czech 1 (adjectives) 1134 108 62613 
Czech 2 4 567 210 336 382 
MatMat: numbers 6 434 60 67 753 
MatMat: addition 3 580 135 20 337 
Math Garden: addition 83 297 30 881994 
Math Garden: multiplic. 97 842 30 =1233024 


relevant skill and an item difficulty (a Rasch model): p = 
exp(0,; — di)~'. This approach is rather standard, for ex- 
ample Piech at al. [26] use very similar procedure and also 
other works use closely related procedures [4, 12]. In the 
experiment reported below the basic setting is 100 learners, 
5 knowledge components with 20 items each. 


To evaluate techniques on realistic educational data, we use 
data from three educational systems. Table 2 describes the 
size of the used data sets. 


Umime Cesky (umimecesky.cz) is a system for practice of 
Czech spelling and grammar. We use data only from one ex- 
ercise from the system — simple “fill-in-the-blank” questions 
with two options. We use only data on the correctness of 
answers (response time is available, but since it depends on 
the text of a particular item its utilization is difficult). We 
focus particularly on one subset of items: questions about 
the choice between i/y in suffixes of Czech adjectives. For 
this subset we have manually determined 7 groups of items 
corresponding to Czech grammar rules. 


MatMat (matmat.cz) is a system for practice of basic arith- 
metic (e.g., counting, addition, multiplication). For each 
item we know the underlying construct (e.g., “13” or “7 + 
8”) and also the specific form of questions (e.g., what type of 
visualization has been used). We use data on both correct- 
ness and response time. We selected the two largest subsets: 
multiplication and numbers (practice of number sens, count- 


ing). 


Math Garden is another system for practice of basic arith- 
metic [16]. This system is more widely used than MatMat, 
but we do not have direct access to the system and detailed 
data. For the analysis we reuse publicly available data from 
previous research |6]. The available data contain both cor- 
rectness of answers and response times, but they contain 
information only about 30 items without any identification 
of these items. 


3.2 Comparison of Similarity Measures 

To evaluate similarity measures we consider several types 
of analysis. With simulated data, we analyze the similarity 
measures with respect to the ground truth while for real- 
world data we evaluate correlations among similarity mea- 
sures. We also compare the quality of subsequent cluster- 
ings using adjusted Rand index (ARI) [27, 31], which mea- 
sures the agreement of two clusterings (with a correction for 
agreement due to chance). Typically, we use the adjusted 
Rand index to compare the clustering with a ground truth 
(available for simulated data) or with a manually provided 
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classification (available for the Czech 1 data set). It can be 
also used to compare two detected clusterings (clusterings 
based on two different algorithms or clusterings based on 
two independent halves of data). 


As a first step in the evaluation of similarity measures, we 
consider experiments with simulated data where we can uti- 
lize the ground truth. In clustering we expect high within- 
cluster similarity values and low between-cluster similarity 
values. Figure 2 shows distribution of the similarity values 
for selected measures and suggest which measures separate 
within-cluster and between-cluster values better and there- 
fore which measures will be more useful in clustering. The 
results show that for Jaccard and Sokal measures the val- 
ues overlap to a large degree, whereas Pearson and Yule 
measures provide better results. Adding the second step — 
Pearson correlation in this example — to the similarity mea- 
sure separates within-cluster and between-cluster values bet- 
ter. That suggests that extending similarities in this way is 
not only necessary step for some subsequent algorithms such 
as k-means but also a useful technique with better perfor- 
mance. 


For data coming from real systems we do not know the 
ground truth and thus we can only compare the similar- 
ity measures to each other. ‘To evaluate how similar two 
measures are we take all similarity values for all item pairs 
and computed correlation coefficient. Figure 3 shows results 
for two data sets which are good representatives of over- 
all results. Pearson and Cohen measures are highly corre- 
lated (> 0.98) across all data sets and have nearly the same 
values (although not exactly the same). Larger differences 
(but only up to 0.1) can be found typically when one of the 
values in the agreement matrix is small and that happens 
only for poorly correlated items with the resulting similar- 
ity value around 0. The second pair of highly correlated 
measures is Ochiai and Jaccard, which are both asymmetric 
with respect to the agreement matrix. ‘The correlation be- 
tween these two pairs of measures vary depending on data 
set and in some cases drops up to 0.5. Because of this high 
correlation within these pairs we further report results only 
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Figure 3: Correlations of similarity measures. 


for Pearson and Jaccard measures. Yule measure is usually 
similar to Pearson measure (correlation usually around 0.9). 
The main difference is that the Yule measure spreads values 
more evenly across the interval [-1, 1]. Sokal is the most 
outlying measure with no correlation or small correlation 
(usually < 0.6) with all other measures. 


Figure 4 shows the effect of the second levels of item sim- 
ilarity on the Pearson measure (results for other measures 
are analogical). The Euclid distance as second level similar- 
ity brings larger differences (lower correlation) than Pearson 
correlation. The correlations for large data sets such as Math 
Garden are usually high (> 0.9) and conversely the lowest 
correlations are found in results for small data sets. This 
suggests that the second level of similarity is more signifi- 
cant, and thus potentially more useful, where only limited 
amount of data is available. 
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Figure 4: Correlations of Pearson measure and Pear- 
son with different second levels. 


Finally, we evaluate the quality of the similarity measures 
according to the performance of the subsequent clustering. 
From the two considered clustering methods we used the hi- 
erarchical clustering in this comparison because it naturally 
works with similarity measure and does not require metric 
space. The other two methods have similar result with same 
conclusions. Table 3 and Figure 5 show results. Although 
the results are dependent on the specific data set and the 
used clustering algorithm, there is quite clear general con- 
clusion. Pearson and Yule measures provide better results 
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Figure 5: The quality of clustering for different mea- 
sures used in the second step of item similarity. Top: 
Simulated data with 5 correlated skills. Bottom: 
Czech grammar with 7 manually determined clus- 
ters. 


than Jaccard and Sokal, i.e., for the considered task the 
later two measures are not suitable. ‘The Pearson is usually 
slightly better than Yule but the choice between them seems 
not to be fundamental (which is not surprising given that 
they are highly correlated). The results also show that the 
“second step” is always useful. The result for simulated data 
favor Euclidean distance over Pearson but there are almost 
no differences for real-world data. 


3.3. Do We Have Enough Data? 


In machine learning the amount of available data often is 
more important than the choice of a specific algorithm [2]. 
Our results suggest that once we choose a suitable type of 
similarity measure (e.g., Pearson, Cohen, or Yule), the dif- 
ferences between these measures are not fundamental, the 
more important issue becomes the size of available data. 


Specifically, for a given data set we want to know whether 
the data are sufficiently large so that the computed item 
similarities are meaningful and stable. This issue can be ex- 
plored by analyzing confidence intervals for computed sim- 
ilarity values. As a simple approach to analysis of similar- 
ity stability we propose the following approach: We split 
the available data into two independent halves (in a learner 
stratified manner), for each half we compute the item simi- 
larities, and we compute the correlation of the resulting item 
similarities. 
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Figure 6: Stability of similarity measure (Yule) for 
real-world data sets. Data set was sampled, split 
to halves and Pearson correlation was computed for 
similarity values. Numbers on the right side indicate 
thousands of answers in data sets. 


We can also perform this computation for artificially reduced 
data sets — this shows how the stability of results increases 
with the size of data. Figure 6 shows this kind of analysis 
for our data (real-world data sets). We clearly see large dif- 
ferences among individual data sets. Math Garden data set 
contains large number of answers and only a few items, the 
results show excellent stability, clearly in this case we have 
enough data to analyze item similarities. For the Czech 
grammar data set we have large number of answers, but 
these are divided among relatively large number of items. 
The results show a reasonably good stability, the data are 
usable for analysis, but clearly more data can bring improve- 
ment. For MatMat data the stability is poor, to draw solid 
conclusions about item similarities we need more data. 


3.4 Response Time Utilization 

The incorporation of response time information to similar- 
ity measure can change the meaning of similarity. Figure 7 
gives such example and shows projection of items from Mat- 
Mat practicing number sense. Similar items according to 
measures using only correctness of answers tend to be items 
with the same graphical representation in the system. On 
the other hand, similar items according to measures using 
also response time are usually items practicing close num- 
bers. 


We used this method also on data sets from Math Garden, 
which are much larger. In this case the use of response 
times has only small impact on the computed item similari- 
ties (correlations between 0.9 and 0.95). However, the use of 
response times influences how quickly does the computation 
converge, i.e., how much data do we need. ‘To explore this 
we consider as the ground truth the average of computed 
similarity matrices with and without response times for the 
whole data set. Then we used smaller samples of the data 
set, used them to compute item similarities and checked the 
agreement with this ground truth. Figure 8 shows the dif- 
ference between speed of convergence of measure with and 
without response time utilization. Results shows that the 
measure which use addition information from response time 
converges to ground truth much faster. This result suggests 
that the use of response time can improve clustering or visu- 
alizations when only small number of answers are available. 
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Table 3: Comparison of similarity measures for one real-world data (with sampled students) set and simulated 
data sets with c knowledge components and / learners. The values provide the adjusted Rand index (with 
0.95 confidence interval) for a hierarchical clustering computed based on the specific similarity measure. The 
top result for every data set is highlighted. 


Czech 1. (c=7) 2=50,c=]5 L=100,ce=]5 1=200,ce=]5 L=100,c=]2 1=100,c=10 
Pearson 0.32 + 0.02 0.26 + 0.04 0.48 + 0.05 0.84 + 0.05 0.77+0.12 0.34 + 0.04 
Jaccard 0.31 + 0.03 0.06 + 0.03 0.15 +0.04 0.29 + 0.08 0.32 +0.18 0.09 + 0.02 
Yule 0.31 + 0.03 0.19 + 0.04 0.43 + 0.05 0.77 + 0.07 0.60 £0.15 0.31 + 0.03 
Sokal 0.15 + 0.06 0.11 + 0.02 0.18 + 0.03 0.25 +0.05 0.12+0.11 0.14 + 0.02 
Pearson — Euclid 0.43 + 0.01 0.45 + 0.05 0.80 + 0.06 0.98 + 0.01 0.95 + 0.03 0.67 + 0.04 
Yule — Euclid 0.32 + 0.02 0.36 + 0.05 0.65 + 0.07 0.94 + 0.04 0.89+0.11 0.43 + 0.03 
Pearson — Pearson 0.41 + 0.03 0.39 + 0.05 0.73 + 0.06 0.96 + 0.02 0.92 + 0.03 0.55 + 0.04 
Yule — Pearson 0.32 + 0.03 0.38 + 0.05 0.72 + 0.06 0.97 + 0.02 0.94 + 0.04 0.55 + 0.05 
1.0 
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Figure 7: Projection of items practicing number 
sense from MatMat system. Left: Measure based 
only correctness. Right: Measure using response 
time. Opacity corresponds to the number value of 
the item and color corresponds to the graphical rep- 


resentation of the task. 


4. DISCUSSION 


Our focus is the automatic computation of item similarities 
based on learners’ performance data. These similarities can 
be then used in further analysis of an item relations such as 
an item clustering or a visualization. ‘This outlines direction 
for future work in which methods using the item similarities 
should be studied in more detail. Compared to alternative 
approaches that have been proposed for the task (e.g., ma- 
trix factorizations, neural networks), the item similarity ap- 
proach is rather straightforward, easy to realize, and it can 
be easily combined with other sources of information about 
items (text of items, expert opinion). For these reasons the 
item similarity approach should be used at least as a baseline 
in proposals for more complex methods like deep knowledge 
tracing [26]. 


The most difficult step in this approach is the choice of a 
similarity measure. Once we make a specific choice, the re- 
alization of the approach is easy. Our results provide some 
guidelines for this choice. Pearson, Yule, and Cohen mea- 
sures lead to significantly better results than Ochiai, Sokal, 
and Jaccard measures. It is also beneficial to use the second 
step of item similarity (e.g., the Euclidean distance over vec- 
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Figure 8: The speed of convergence to ground truth 
for measures with and without response time on 
Math Garden addition data set. 


tors of item similarities). The exact choice of details does not 
seem to make fundamental difference (e.g., Pearson versus 
Yule in the first step, the Euclidean distance versus Pear- 
son correlation in the second step). The Pearson correla- 
tion coefficient is a good “default choice”, since it provides 
quite robust results and is applicable in several settings and 
steps. It also has the pragmatic advantage of having fast, 
readily available implementation in nearly all computational 
environments, whereas measures like Yule may require ad- 
ditional implementation effort. 


The amount of data available is the critical factor for the suc- 
cess of automatic analysis of item relations. A key question 
for practical applications is thus: “Do we have enough data 
to use automated techniques?” In this work we used several 
specific methods for analysis of this question, but the issue 
requires more attention — not just for the item similarity 
approach, but also for other methods proposed in previous 
work. For example previous work on deep knowledge trac- 
ing [26], which studies closely related issues, states only that 
deep neural networks require large data without providing 
any specific quantification what ‘large’ means. The necess- 
sary quantity of data is, of course, connected to the quality 
of data — some data sources are more noisy than other, e.g., 
answers from voluntary practice contain more noise than an- 
swers from high-stakes testing. An important direction for 
future work is thus to compare model based and item simi- 
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larity approaches while taking into account the ‘amount and 
quality of data available’ issue. 
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ABSTRACT 


Massive open online courses (MOOCs) have demonstrated grow- 
ing popularity and rapid development in recent years. Discussion 
forums have become crucial components for students and instruc- 
tors to widely exchange ideas and propagate knowledge. It is im- 
portant to recommend helpful information from forums to students 
for the benefit of the learning process. However, students or in- 
structors update discussion forums very often, and the student pref- 
erences over forum contents shift rapidly as a MOOC progresses. 
So, MOOC forum recommendations need to be adaptive to these 
evolving forum contents and drifting student interests. These fre- 
quent changes pose a challenge to most standard recommendation 
methods as they have difficulty adapting to new and drifting ob- 
servations. We formalize the discussion forum recommendation 
problem as a sequence prediction problem. Then we compare dif- 
ferent methods, including a new method called context tree (CT), 
which can be effectively applied to online sequential recommen- 
dation tasks. The results show that the CT recommender performs 
better than other methods for MOOCs forum recommendation task. 
We analyze the reasons for this and demonstrate that it is because 
of better adaptation to changes in the domain. This highlights the 
importance of considering the adaptation aspect when building rec- 
ommender system with drifting preferences, as well as using ma- 
chine learning in general. 


Keywords 


MOOCs forum recommendation, context tree, model adaptation 


1. INTRODUCTION 


With the increased availability of data, machine learning has be- 
come the method of choice for knowledge acquisition in intelligent 
systems and various applications. However, data and the knowl- 
edge derived from it have a timeliness, such that in a dynamic en- 
vironment not all the knowledge acquired in the past remains valid. 
Therefore, machine learning models should acquire new knowl- 
edge incrementally and adapt to the dynamic environments. To- 
day, many intelligent systems deal with dynamic environments: in- 
formation on websites, social networks, and applications in com- 
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mercial markets. In such evolving environments, knowledge needs 
to adapt to the changes very frequently. Many statistical machine 
learning techniques interpolate between input data and thus their 
models can adapt only slowly to new situations. In this paper, 
we consider the dynamic environments for recommendation task. 
Drifting user interests and preferences [3, 11] are important in build- 
ing personal assistance systems, such as recommendation systems 
for social networks or for news websites where recommendations 
need be adaptive to drifting trends rather than recommending ob- 
solete or well-known information. We focus on the application 
of recommending forum contents for massive open online courses 
(MOOCs) where we found that the adaptation issue is a crucial as- 
pect for providing useful and trendy information to students. 


The rapid emergence of some MOOC platforms and many MOOCs 
provided on them has opened up a new era of education by pushing 
the boundaries of education to the general public. In this special on- 
line classroom setting, sharing with your classmates or asking help 
from instructors is not as easy as in traditional brick-and-mortar 
classrooms. So discussion forums there have become one of the 
most important components for students to widely exchange ideas 
and to obtain instructors’ supplementary information. MOOC fo- 
rums play the role of social learning media for knowledge propaga- 
tion with increasing number of students and interactions as a course 
progresses. Every member in the forum can talk about course con- 
tent with each other, and the intensive interaction between them 
supports the knowledge propagation between members of the learn- 
ing community. 


The online discussion forums are usually well structured via the 
different threads which are created by students or instructors; they 
can contain several posts and comments within the topic. An ex- 
ample of the discussion forum from a famous “Machine Learning” 
course by Andre Ng on Coursera’ is shown in Figure 1. The left 
figure shows various threads and the right figure illustrates some 
replies within the last thread ("Having a problem with the Collab- 
orative Filtering Cost"). In general, the replies within a thread are 
related to the topic of the thread and they can also refer to some 
other threads for supplementary information, like the link in the 
second reply. Our goal is to point the students towards useful fo- 
rum threads through effectively mining forum visit patterns. 


Two aspects set forum recommendation system for MOOCs apart 
from other recommendation scenarios. First, student interests and 
preferences drift fast during the span of a course, which is influ- 
enced by the dynamics in forums and the content of the course; 
second, the pool of items to be recommended and the items them- 


‘https://www.coursera.org/ 
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Figure 1: An sample discussion forum. Left: sample threads. Right: replies within the last thread ("Having a problem with the Collaborative Filtering Cost"). 


selves are evolving over time because forum threads can be edited 
very frequently by either students or instructors. So the recommen- 
dations provided to students need to be adaptive to these drifting 
preferences and evolving items. Traditional recommendation tech- 
niques, such as collaborative filtering and methods based on ma- 
trix factorization, only adapt slowly, as they build an increasingly 
complex model of users and items. Therefore, when a new item is 
superseded by a newer version or a new preference pattern appears, 
it takes time for recommendations to adapt. To better address the 
dynamic nature of recommendation in MOOCs, we model the rec- 
ommendation problem as a dynamic and sequential machine learn- 
ing problem for the task of predicting the next item in a sequence of 
items consumed by a user. During the sequential process, the chal- 
lenge is combining old knowledge with new knowledge such that 
both old and new patterns can be identified fast and accurately. We 
use algorithms for sequential recommendation based on variable- 
order Markov models. More specifically, we use a structure called 
context tree (CT) [21] which was originally proposed for lossless 
data compression. We apply the CT method for recommending 
discussion forum contents for MOOCs, where adapting to drift- 
ing preferences and dynamic items is crucial. In experiments, it is 
compared with various sequential and non-sequential methods. We 
show that both old knowledge and new patterns can be captured ef- 
fectively through context activation using CT, and that this is why 
it is particularly strong at adapting to drifting user preferences and 
performs extremely well for MOOC forum recommendation tasks. 


The main contribution of this paper is fourfold: 


e We applied the context tree structure to a sequential recom- 
mendation tasks where dynamic item sets and drifting user 
preferences are of great concern. 


e Analyze how the dynamic changes in user preferences are 
followed in different recommendation techniques. 


e Extensive experiments are conducted for both sequential and 
non-sequential recommendation settings. Through the ex- 
perimental analysis, we validate our hypothesis that the CT 
recommender adapts well to drifting preferences. 
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e Partial context matching (PCT) technique, built on top of the 
standard CT method, is proposed and tested to generalize to 
new sequence patterns, and it further boosts the recommen- 
dation performance. 


2. RELATED WORK 


Typical recommender systems adopt a static view of the recommen- 
dation process and treat it as a prediction problem over all historical 
preference data. From the perspective of generating adaptive rec- 
ommendations,we contend that it is more appropriate to view the 
recommendation problem as a sequential decision problem. Next, 
we mainly review some techniques developed for recommender 
systems with temporal or sequential considerations. 


The most well-known class of recommender system is based on 
collaborative filtering (CF) [19]. Several attempts have been made 
to incorporate temporal components into the collaborative filtering 
setting to model users’ drifting preferences over time. A common 
way to deal with the temporal nature is to give higher weights to 
events that happened recently. [6, 7, 15] introduced algorithms 
for item-based CF that compute the time weightings for different 
items by adding a tailored decay factor according to the user’s own 
purchase behavior. For low dimensional linear factor models, [11] 
proposed a model called ““TimeSVD” to predict movie ratings for 
Netflix by modeling temporal dynamics, including periodic effects, 
via matrix factorization. As retraining latent factor models is costly, 
one alternative is to learn the parameters and update the decision 
function online for each new observation [1, 16]. [10] applied the 
online CF method, coupled with an item popularity-aware weight- 
ing scheme on missing data, to recommending social web contents 
with implicit feedbacks. 


Markov models are also applied to recommender systems to learn 
the transition function over items. [24] treated recommendation as 
a univariate time series problem and described a sequential model 
with a fixed history. Predictions are made by learning a forest of 
decision trees, one for each item. When the number of items is big, 
this approach does not scale. [17] viewed the problem of generating 
recommendations as a sequential decision problem and they con- 
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sidered a finite mixture of Markov models with fixed weights. [4] 
applied Markov models to recommendation tasks using skipping 
and weighting techniques for modeling long-distance relationships 
within a sequence. A major drawback of these Markov models is 
that it is not clear how to choose the order of Markov chain. 


Online algorithms for recommendation are also proposed in sev- 
eral literatures. In [18], a Q-learning-based travel recommender is 
proposed, where trips are ranked using a linear function of several 
attributes and the weights are updated according to user feedback. 
A multi-armed bandit model called LinUCB is proposed by [13] 
for news recommendation to learn the weights of the linear reward 
function, in which news articles are represented as feature vectors; 
click-through rates of articles are treated as the payoffs. [20] pro- 
posed a similar recommender for music recommendation with rat- 
ing feedback, called Bayes-UCB, that optimizes the nonlinear re- 
ward function using Bayesian inference. [14] used a Markov De- 
cision Process (MDP) to model the sequential user preferences for 
recommending music playlists. However, the exploration phase of 
these methods makes them adapt slowly. As user preferences drift 
fast in many recommendation setting, it is not effective to explore 
all options before generating useful ones. 


Within the context of recommendation for MOOCs, [23] proposed 
an adaptive feature-based matrix factorization framework for course 
forum recommendation, and the adaptation is achieved by utilizing 
only recent features. [22] designed a context-aware matrix factor- 
ization model to predict student preferences for forum contents, and 
the context considered includes only supplementary statistical fea- 
tures about students and forum contents. In this paper, we focus on 
a class of recommender systems based on a structure, called con- 
text tree [21], which was originally used to estimate variable-order 
Markov models (VMMs) for lossless data compression. Then, [2, 
12, 5] applied this structure to various discrete sequence predic- 
tion tasks. Recently it was applied to news recommendation by 
[8, 9]. The most important property of online algorithms is the no- 
regret property, meaning that the model learned online is eventually 
as good as the best model that could be learned offline. Accord- 
ing to [21], the no-regret property is achieved by context trees for 
the data compression problem. Regret analysis for CT was con- 
ducted through simulation by [5] for stochastically generated hid- 
den Markov models with small state space. They show that CT 
achieves the no-regret property when the environment is stationary. 
As we focus on dynamic recommendation environments with time- 
varying preferences and limited observations, the no-regret prop- 
erty can be hardly achieved while the model adaptation is a bigger 
issue for better performance. 


3. CONTEXT TREE RECOMMENDER 


Due to the sequential item consumption process, user preferences 
can be summarized by the last several items visited. When model- 
ing the process as a fixed-order Markov process [17], it is difficult 
to select the order. A variable-order Markov model (VMM), like a 
context tree, alleviates this problem by using a context-dependent 
order. The context tree is a space efficient structure to keep track 
of the history in a variable-order Markov chain so that the data 
structure is built incrementally for sequences that actually occur. A 
local prediction model, called expert, is assigned to each tree node, 
it only gives predictions for users who have consumed the sequence 
of items corresponding to the node. In this section, we first intro- 
duce how to use the CT structure and the local prediction model for 
sequential recommendation. Then, we discuss adaptation proper- 
ties and the model complexity of the CT recommender. 
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3.1 The Context Tree Data Structure 


In CT, a sequence s = (nz,..., mz) is an ordered list of items 
n; € N consumed by a user. The sequence of items viewed until 
time ¢ is s; and the set of all possible sequences S. 


A context S = {s € S: € <s} is the set of all possible sequences 
in S ending with the suffix €. € is the suffix (<) of s if last elements 
of s are equal to €. For example, one suffix € of the sequence 
S = (n2,7N3, 71) is given by € = (n3, 71). 


A context tree 7 = (V, €) with nodes V and edges € is a partition 
tree over all contexts of S. Each node iz € Y in the context tree 
corresponds to a context S;. If node 2 is the ancestor of node 7 then 
S; C S;. Initially the context tree 7 only contains a root node 
with the most general context. Every time a new item is consumed, 
the active leaf node is split into a number of subsets, which then 
become nodes in the tree. This construction results in a variable- 
order Markov model. Figure 2 illustrates a simple CT with some 
sequences over an item set (ni, 2,3). Each node in the CT cor- 
responds to a context. For instance, the node (ni) represents the 
context with all sequences end with item 71. 


{<N2, N3,N,>! 


Figure 2: An example context tree. For the sequence s = (n2,73,71), 
nodes in red-dashed are activated. 


3.2 Context Tree for Recommendation 

For each context S;, an expert ju; is associated in order to compute 
the estimated probability P(n:+1|s:) of the next item nz41 under 
this context. A user’s browsing history s; is matched to the CT and 
identifies a path of matching nodes (see Figure 2). All the experts 
associated with these nodes are called active. The set of active 
experts A(s:) = {ti : & ~ sz} is the set of experts ju; associated 
to contexts S; = {s : &; < s;} such that €; are suffix of s;. A(s:¢) 
is responsible for the prediction for s;. 


3.2.1 Expert Model 


The standard way for estimating the probability P(nz+1|sz), as pro- 
posed by [5], is to use a Dirichlet-multinomial prior for each expert 
fui. The probability of viewing an item x depends on the number of 
times a the item x has been consumed when the expert is active 
until time t. The corresponding marginal probability is: 

Qet + Qo 


P;(n =7\s-)= 1 
(Me41 |St) jew Og + A (1) 


where qo 1s the initial count of the Dirichlet prior 
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3.2.2. Combining Experts to Prediction 

When making recommendation for a sequence s;, we first identify 
the set of contexts and active experts that match the sequence. The 
predictions given by all the active experts are combined by mixing 
the recommendations given by them: 


P(ne41 = 2|8¢) = S- ui(st)Pi(meri = 2|Se) (2) 
t€A(st) 


The mixture coefficient wi(s:) of expert 4; is computed in Eq. 3 
using the weight w; € [0,1]. Weight w,; is the probability that 
the chosen recommendation stops at node 7 given that the it can be 
generated by the first 2 experts, and it can be updated in using Eq.5. 


ui(Se) = Wi I]j:s;cs,Q wi); if s+ i, (3) 
0, otherwise 

The combined prediction of the first 2 experts is defined as q; and 

it can be computed using the recursion in Eq. 4. The recursive 

construction that estimates, for each context at a certain depth 2, 

whether it makes better prediction than the combined prediction 

gi—1 from depth z — 1. 


qa = wiPi(nty1 = zlse) + (1 — wi)qi-1 (4) 


The weights are updated by taking into account the success of a 
recommendation. When a user consumes a new item x, we update 
the weights of the active experts corresponding to the suffix ending 
before x according to the probability q;(«) of predicting x sequen- 
tially via Bayes’ theorem. The weights are updated in closed form 
in Eq. 5, and a detailed derivation can be found in [5]. 

/ wiPi (neq = x|St) 


i= 5 
as qi(x) m 


3.2.3 CT Recommender Algorithm 


The whole recommendation process first goes through all users’ ac- 
tivity sequences over time incrementally to build the CT; the local 
experts and weights updated using Equations | and 5 respectively. 
As users browse more contents, more contexts and paths are added 
and updated, thus building a deeper, more complete CT. The rec- 
ommendation for an activity or context in a sequence is generated 
using Eq. 2 continuously as experts and weights are updated. At 
the same time, a pool of candidate items is maintained through a 
dynamically evolving context tree. As new items are added, new 
branches are created. At the same time, nodes corresponding to old 
items are removed as soon as they disappear from the current pool. 


The CT recommender is a mixture model. On the one hand, the 
prediction P(ni41 = x|sz) is a mixture of the predictions given 
by all the activated experts along the activated path so that it’s a 
mixtures of local experts or a mixture of variable order Markov 
models whose oder are defined by context depths. On the other 
hand, one path in a CT can be constructed or updated by multiple 
users so that it’s a mixture of users’ preferences. 


3.3. Adaptation Analysis 

Our hypothesis, which is validated in later experiments, is that the 
CT recommender can be applied elegantly to domains where adap- 
tation and timeliness are of concern. Two properties of the CT 
methods are crucial to the goal. First, the model parameter learn- 
ing process and recommendations generated are online such that 
the model adapts continuously to a dynamic environment. Second, 
adaptability can be achieved by the CT structure itself as knowl- 
edge is organized and activated by context. New items or paths are 
recognized in new contexts, whereas old items can still be accessed 
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in their old contexts. It allows the model to make predictions us- 
ing more complex contexts as more data is acquired so that old and 
new knowledge can be elegantly combined. For new knowledge 
or patterns added to an established CT, they can immediately be 
identified through context matching. This context organization and 
context matching mechanism help new patterns to be recognized to 
adapt to changing environments. 


3.4 Complexity Analysis 

Learning CT uses the recursive update defined in Eq. 4 and rec- 
ommendations are generated by weighting the experts’ predictions 
along the activated path given by Eq. 2. For trees of depth D, the 
time complexity of model learning and prediction for a new ob- 
servation are both O(D). For input sequence of length 7’, the up- 
dating and recommending complexity are O(M*), where M = 
min(D,T). Space complexity in the worst case is exponential to 
the depth of the tree. However, as we do not generate branches 
unless the sequence occurs in the input, we achieve a much lower 
bound determined by the total size of the input. So the space com- 
plexity is O(N), where N is the total number of observations. 
Compared with the way that Markov models are learned, in which 
the whole transition matrix needs to be learned simultaneously, the 
space efficiency of CT offers us an advantage for model learning. 
For tasks that involve very long sequences, we can limit the depth 
D of the CT for space and time efficiency. 


4. DATASET AND PROBLEM ANALYSIS 
4.1 Dataset Description 


In this paper, we work with recommending discussion forum threads 
to MOOC students. A forum thread can be updated frequently and 
it contains multiple posts and comments within the topic. As we 
mentioned before that the challenge is adapting to drifting user 
preferences and evolving forum threads as a course progresses. For 
the experiments elaborated in the following section, we use forum 
viewing data from three courses offered by Ecole polytechnique 
fédérale de Lausanne on Coursera. These three courses include 
the first offering of “Digital Signal Processing’, the third offer- 
ing of “Functional Program Design in Scala”, and the first offer- 
ing of “’Reactive Programming’. They are referred to Course 1/, 
Course 2 and Course 3. Some discussion forum statistics for the 
three courses are given in Table 1. From the number of forum 
participants, forum threads, and thread views, we can see that the 
course scale increase from Course I to Course 3. A student on 
MOOCs often accesses course forums many times during the span 
of a MOOC. Each time the threads she views are tracked as one 
visit session by the web browser. The total number of visit sessions 
and the average session lengths for three courses are presented in 
Table 1. The length of a session is the number of threads she viewed 
within a visit session. The thread viewing sequences correspond- 
ing to these regular visit sessions are called separated sequences 
in our later experiments and they treat threads in one visit session 
as one sequence. Models built using separated sequences try to 
catch short-term patterns within one visit session and we do not 
differentiate the patterns from different students. Another setting, 
called combined sequences, concatenates all of a student’s visit ses- 
sions into one longer sequence so that models built using combined 
sequences try to learn long-term patterns across students. The av- 
erage length of combined sequences is the average session length 
times the average number of sessions per student. From Course 1 
to Course 3, average lengths for separated and combined sequences 
both increase. 


2 


# of forum participants 13,914 
# of forum threads 2,404 
# of thread views 130,093 | 379,456 | 777,304 
# of sessions 19,892 40,764 30,082 
avg. session length 6.5 9 25.8 
avg. # of sessions per student 3.1 3.3 Due 


Table 1: Course forum statistics for three datasets. 


Another important issue that we can discover from the statistics is 
that thread viewing data available for sequential recommendation is 
very sparse. For example in Course 1, the average session length is 
6.5 and the number of threads is around 1116. Then the complete 
space to be explored will be 1116°-°, which is much larger than 
the size of observations (130,093 thread views). The similar data 
sparsity issue 1s even more severe in the other two datasets. 


4.2 Forum Thread View Pattern 


Next, we study the thread viewing pattern which highlights the sig- 
nificance of adaptation issues for thread recommendation. Figure 
3 illustrates the distribution of thread views against freshness for 
three courses. The freshness of an item is defined as the relative 
creation order of all items that have been created so far. For ex- 
ample, when a student views a thread t,,, which is the m-th thread 
created in the currently existing pool of n threads, then freshness 
of ty, 1s defined as: 


freshness = ~ (6) 


We can see from Figure 3 that there is a sharp trend that the new 

forum threads are viewed much more frequently than the old ones 
for all three courses. It is mainly due to the fact that fresh threads 
are closely relevant to the current course progress. Moreover, fresh 
threads can also supersede the contents in some old ones to be 
viewed. This tendency to view fresh items leads to drifting user 
preferences. Such drifting preferences, coupled with the evolv- 
ing nature of forum contents, requires recommendations adaptive 
to drifting or recent preferences. 


‘ Distribution of Thread Views against Freshness 


—Course 1 
—Course 2 
0.25 —Course 3 


Probability 


0 0.2 0.4 0.6 0.8 1 
Freshness 


Figure 3: Thread viewing activities against freshness 


A further investigation through those views on old threads leads us 
to a classification of threads into two categories: general threads 
and specific threads. Some titles of the general and specific threads 
are listed in Table 2. We could see the clear difference between 
these two classes of threads as the general ones corresponds to 
broad topics and specific ones are related to detailed course con- 
tents or exercises. We also found that only a very small part of the 
old threads are still rather active to be viewed and they are mostly 
general ones. Different from general threads, specific threads that 
subject to a fine timeliness are viewed very few times after they get 
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old. In general, sequential patterns are observed more often within 
specific threads as some specific follow-up threads might be related 
and useful to the one that you are viewing. So the patterns learned 
could be used to guide your forum browsing process. On the con- 
trary, sequential patterns on general threads are relatively random 
and imperceptible. 


General Threads Specific Threads 
“Using GNU Octave” “Homework Day | / Question 9” 
“Any one from INDIA??” “Quiz for module 4.2” 


“Where is everyone from? 
“Numerical Examples in pdf” 
“How to get a certificate” 


“quiz -1 Question 04” 
‘Homework 3, Question 11” 
“Week 1: Q1O0 GEMA problem” 


Table 2: Sample thread titles of general and specific threads. 


5. RESULTS AND EVALUATION 


In this section, we compare the proposed CT method against var- 
ious baseline methods in both non-sequential and sequential set- 
tings. The results show that the CT recommender performs better 
than other methods under different setting for all three MOOCs 
considered. Through the adaptation analysis, we validate our hy- 
pothesis that the superior performance of CT recommender comes 
from the adaptation power to drifting preferences and trendy pat- 
terns in the domain. In the end, a regularization technique for CT, 
called partial context matching (PCT), is introduced. It is demon- 
strated that PCT helps better generalize among sequence patterns 
and further boost performance. 


5.1 Baseline Methods 


5.1.1 Non-sequential Methods 

Matrix factorization methods proposed by [23, 22] are the state-of- 
the-art for MOOCs course content recommendation. Besides the 
user-based MEF given in [23], we also consider item-based MF that 
generates recommendations based on the similarity of the latent 
item features learned from standard MF. In our case, each entry in 
the user-item matrix of MF contains the number of times a student 
views a thread. We also test a version where the matrix had a | for 
any number of views, but the performance was not as good, so the 
development of this version was not taken any further. MF mod- 
els considered here are updated periodically (week-by-week). To 
enable a fair comparison against non-sequential matrix factoriza- 
tion techniques, we implemented versions where the CT model is 
updated at fixed time intervals, equal to those of the MF models. 
In the “One-shot CT” version, we compute the CT recommenda- 
tions for each user based on the data available at the time of the 
model update, and the user then receives these same recommenda- 
tions at every future time step until the next update. This mirrors 
the conditions of user-based MF. To compare with item-based MF, 
the “Slow-update CT” version updates the recommendations, but 
not the model, at each time point based on the sequential forum 
viewing information available at that time. 


5.1.2 Sequential Methods 


Sequential methods update model parameters and recommenda- 
tions continuously as items are consumed. The first two simple 
methods are based on the observation and heuristic that fresh threads 
are viewed much frequently than old ones. Fresh_I recommends 
the last 5 updated threads, and Fresh_2 recommends the last 5 cre- 
ated threads. Another baseline method, referred as Popular, recom- 
mends the top 5 threads among the last 100 threads viewed before 
the current one. We also consider an online version of MF [10] that 
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Figure 4: Overall performance comparison of CT and non-sequential methods 


is currently the state-of-the-art sequential recommendation method, 
referred to “online-MF’”, in which the corresponding latent factor 
of the item 7 and user u are updated when a new observation R,,; 
arrive. The model optimization is implemented based on element- 
wise Alternating Least Squares. The number of latent factors is 
tuned to be 15, 20, 25 for three datasets, and the regularization pa- 
rameter is set as 0.01. Moreover, the weight of a new observation is 
the same as old ones during optimization for achieving the best per- 
formance. Furthermore, the proposed CT recommender refers to 
the full context tree algorithm with a continuously updated model. 


5.2 Performance and Adaptation Analysis 


5.2.1 Evaluation Metrics 
In our case, all methods recommend top-5 threads each time. Two 
evaluation metrics are adopted in the following experiments: 


e Succ@5: the mean average precision (MAP) of predicting 
the immediately next thread view in a sequence. 


e Succ@S5Ahead: the MAP of predicting the future thread 
views within a sequence. In this case, a recommendation 
is successful even if it is viewed later in a sequence. 


5.2.2 Comparison of Non-sequential Methods 

Figure 4 shows the performance comparison between different ver- 
sions of methods based on MF and CT on three datasets. “CT” 
is the sequential method with a continuously updated model, and 
all other methods Figure 4 are non-sequential versions. Combined 
sequences are used for the CT methods here to have a parallel com- 
parison against MF. We found that a small value of the depth limit 
of the CTs hurts performances, yet a very large depth limit does 
not increase performance at the cost of computation and memory. 
Through experiments, we tune depths empirically and set them as 
15, 20, 30 for three datasets. 


Among non-sequential methods, one-shot CT and user-based MF 
perform the worst for all three courses, which means that recom- 
mending the same content for the next week without any sequence 
consideration is ineffective. Slow-update CT performs consistently 
the best among non-sequential methods, and it proves that adapting 
recommendations through context tree helps boost performance al- 
though the model itself is not updated continuously. Compared 
to slow-update CT, item-based MF performs much worse. They 
both update model parameters periodically and the recommenda- 
tions are adjusted given the current observation. However, using 
the contextual information within a sequence and the correspond- 
ing prediction experts of slow-update CT are much more powerful 
than just using latent item features of item-base MF. Moreover, we 
can clearly see that the normal CT with continuous update outper- 
forms all other non-sequential methods by a large margin for three 
datasets. It means that drifting preferences need to be followed 
though continuous and adaptive model update, so sequential meth- 
ods are better choices. Next, we focus on sequential methods, and 
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we validate our hypothesis that the CT model has superior perfor- 
mances because it better handles drifting user preferences. 


5.2.3 Comparison of Sequential Methods 

The results presented in Table 3 show the performance of the full 
CT recommender compared with other sequential baseline meth- 
ods under different settings and evaluation metrics. Each result 
tuple contains the performance on the three datasets. We also con- 
sider a tail performance metric, referred to personalized evaluation, 
where the most popular threads (20, 30, and 40 for three courses) 
are excluded from recommendations. The depth limits of CTs us- 
ing separated sequences are set to 8, 10, and 15 for three courses. 


We notice that the online-MF method, with continuous model up- 
date, performs much worse compared with the CT recommender 
for all three datasets. This result shows that matrix factorization, 
which is based on interpolation over the user-item matrix, is not 
sensitive enough to rapidly drifting preferences with limited ob- 
servations. The performances of two versions of the Fresh rec- 
ommender are comparable with online-MF, and Fresh_/ even out- 
performs online-MF in many cases, especially for Succ@5Ahead. 
It means that simply recommending fresh items even does a bet- 
ter job than online-MF for this recommendation task with drifting 
preferences. We can see that the CT recommender outperforms 
all other sequential methods under various settings, except for us- 
ing non-personalized Succ@5Ahead for Course 2. The Popular 
recommender is indeed a very strong contender when using non- 
personalized evaluation since there is a bias that students can click a 
“top threads” tag from user interface to view popular threads which 
are similar to the ones given by Popular recommender. From the 
educational perspective, the setting using separated sequences and 
personalized evaluation is the most interesting as it reflects shot- 
term visiting patterns within a session over those specific and less 
popular forum threads. We could see from the upper right part of 
Table 3 that the CT recommender outperforms all other methods by 
a large margin under this setting. 
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online-MF] [9, 8, 7]% | [34, 27, 23]% | [7,6,6]% | [29, 24, 20]% 
Popular }[13, 14, 14]%] [52, 62, 58]% | [9,8,7]% | [45, 36, 43]% 
Fresh_1 | [10, 12, 9]% | [48, 44, 44]% | [8,9, 8]% | [44, 34, 42]% 
Fresh_2 | [7,6,6]% | [43, 34, 32]% | [6,6,6]% | [42, 32, 31]% 


Table 3: Performance comparison of sequential methods 


5.2.4 Adaptation Comparison 
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Figure 5: Distribution of recommendation freshness of CT and online-MF 
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Figure 6: Conditional success rate of CT and online-MF 


After seeing the superior performance of the CT recommender, we 
move to an insight analysis of the results. To be specific, we com- 
pare CT and online-MEF in terms of their adaptation capabilities 
to new items. Figure 5 illustrates the cumulative density func- 
tion (CDF) of the threads recommended by different methods against 
thread freshness. We can see that the CDFs of CT increase sharply 
when thread freshness increases, which means that the probability 
of recommending fresh items is high compared to online-MF. In 
other words, CT recommends more fresh items than online-MF. As 
we mentioned before that a large portion of fresh threads are spe- 
cific ones, instead of general ones, so CT recommends more spe- 
cific and trendy threads to students while methods based on matrix 
factorization recommend more popular and general threads. 


Other than the quantity of recommending fresh and specific threads, 
the quality is crucial as well. Figure 6 shows the conditional suc- 
cess rate P(Success| Freshness) across different degrees of fresh- 
ness for three courses. P(.Success|F'reshness) is defined as the 
fraction of the items successfully recommended given the item fresh- 
ness. For instance, if an item with freshness 0.5 is viewed 100 times 
throughout a course, then P(Success|F'reshness = 0.5) = 0.25 
means it is among the top 5 recommended items 25 times. As 
the freshness increases, the conditional success rate of online-MF 
drops speedily while the CT method keeps a solid and stable per- 
formance. It is significant that CT outperforms online-MF by a 
large margin when freshness is high, in other words, it 1s particu- 
larly strong for recommending fresh items. Fresh items are often 
not popular in terms of the total number of views at the time point 
of recommendation. So identifying fresh items accurately implies 
a strong adaptation power to new and evolving forum visiting pat- 
terns. The analysis above validates our hypothesis that the CT rec- 
ommender can adapt well to drifting user preferences. Another 
conclusion drawn from Figure 6 is that the performance of CT is as 
good as online-MF for items with low freshness. This is because 
that the context organization and context matching mechanism help 
old items to be identifiable though old contexts. To conclude, CT is 
flexible at combining old knowledge and new knowledge so that it 
performances well for items with various freshness, especially for 
fresh ones with drifting preferences. 


5.3. Partial Context Matching (PCT) 


At last, we introduce another technique, built on top of the stan- 
dard CT, to generalize to new sequence patterns and further boost 
the recommendation performance. The standard CT recommender 
adopts a complete context matching mechanism to identify active 
experts for a sequence s. That is, active experts of s come exactly 
from the set of suffixes of s. We design a partial context match- 
ing (PCT) mechanism where active experts of a sequence are not 
constrained by exact suffixes, yet they can be those very similar 
ones. Two reasons bring us to design the PCT mechanism for con- 
text tree learning. First, PCT mechanism is a way of adding regu- 
larization. Sequential item consumption process does not have to 
follow exactly the same order, and slightly different sequences are 
also relevant for both model learning and recommendation gener- 
ation. Second, the data sparsity issue we discussed before for se- 
quential recommendation setting can be solved to some extent by 
considering similar contexts for learning model experts. The way 
PCT does aims to activate more experts to train the model, and to 
generate recommendations from a mixture of similar contexts. 


We will focus on a skip operation that we add on top of the standard 
CT recommender. Some complex operations, like swapping item 
orders, are also tested, but they do not generate better performance. 
For a sequence (s,,..., 51) with length p, the skip operation gen- 
erates p candidate partially matched contexts that skip one s; for 
k € {1...p]. All the contexts on the paths from root to partially 
matched contexts are activated. For example, the path to context 
(n2, 71) can be activated from the context (n2, 3, 11) by the skip- 
ping n3. However, for each partially matched context, there may 
not exist a fully matched path in the current context tree. In this 
case, for each partially matched context, we identify the longest 
path that corresponds it with length qg. If q/p is larger than some 
threshold t, we update experts on this paths and use them to gener- 
ate recommendations for the current observation. Predictions from 
multiple paths are combined by averaging the probabilities. 


uccess©@ ee LOE cad 


+0.5, +0.8, ae ey +1.3, +0.5]1% 
+0.7, +0.9, +0.5]% +1.6, +1.9, +0.7]% 


+0.8, +1.1, +0.6]% 
+1.0, +1.4, +0.7]% 


Table 4: Performance comparison of PCT against CT for three courses 
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Table 4 shows the performance of applying PCT for both model 
update and recommendation with threshold t (PCT-t). Results are 
compared with the full CT recommender with separated sequences 
and non-personalized evaluation. For cases where the threshold is 
smaller than 0.5, we sometimes obtain negative results since par- 
tially matched contexts are too short to be relevant. The “Ratio” 
column is the ratio of the number of updated paths in PCT com- 
pared with standard CT. We can see that PCT updates more paths 
and it offers us consistent performance boosts at the cost of com- 
putation. 


6. CONCLUSION AND FUTURE WORK 


In this paper, we formulate the MOOC forum recommendation 
problem as a sequential decision problem. Through experimental 
analysis, both performance boost and adaptation to drifting prefer- 
ences are achieved using anew method called context tree. Further- 
more, a partial context matching mechanism is studied to allow a 
mixture of different but similar paths. As a future work, exploratory 
algorithms are interesting to be tried. As exploring all options for 
all contexts are not feasible, we consider to explore only those top 
options from similar contexts. Deploying the CT recommender in 
some MOOCs for online evaluation would be precious to obtain 
more realistic evaluation. 
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ABSTRACT 


Problem-solving skills in creative, open-ended domains are both 
important and little understood. These domains are generally ill- 
structured, have extremely large exploration spaces, and require 
high levels of specialized skill in order to produce quality solutions. 
We investigate problem-solving behavior in one such domain, the 
scientific-discovery game Foldit. Our goal is to discover differentiat- 
ing patterns and understand what distinguishes high and low levels 
of problem-solving skill. To address the challenges posed by the 
scale, complexity, and ill-structuredness of Foldit solver behavior 
data, we devise an iterative visualization-based methodology and use 
this methodology to design a concise, meaning-rich visualization of 
the problem-solving process in Foldit. We use this visualization to 
identify key patterns in problem-solving approaches, and report how 
these patterns distinguish high-performing solvers in this domain. 


Keywords 


Problem Solving; Scientific-Discovery Games; Visualization 


1, INTRODUCTION 


As efforts in scalable online education expand, interest continues 
to increase in moving beyond small, highly constrained tasks, such 
as multiple choice or short answer questions, and incorporating 
creative, open-ended activities [7, 14]. Existing research supports 
this move, showing that problem-based learning can enhance stu- 
dents’ problem-solving and metacognitive skills [11]. Scaling such 
activities poses significant challenges, however, in terms of both as- 
sessment and feedback. It will be vital to devise scalable techniques 
not only to assess students’ final products, but also to understand 
their progress through complex and heterogeneous problem-solving 
spaces. These techniques will apply to a broad range of education 
settings, from purely online programs like Udacity’s Nanodegrees 
to more traditional settings where new standards like the Common 
Core emphasize strategic problem solving. 


A growing body of work has found that educational and serious 
games are fertile ground for assessing students’ capabilities and 
problem-solving skills [6, 10]. Our work continues this general 
line of inquiry by examining creative, problem-solving behavior 
among players in the scientific-discovery game Foldit. By modeling 
the functions of proteins, the workhorses of living cells, Foldit 
challenges players, hereafter referred to as solvers, to resolve the 
shape of proteins as a 3D puzzle. These puzzles are completely 
open and often under-specified, making it a highly suitable setting 
in which to gain insight into student progress through complex 
solution spaces. In the Foldit scientific-discovery community, the 
focus is on developing people from novices to experts that are 
eventually capable of solving protein structure problems that are 


currently unsolved by the scientific community. In fact, solutions 
produced in Foldit have led to three results published in Nature [3, 
5, 16]. Foldit is an attractive learning space domain because its 
solvers are capable of contributing to state-of-the-art biochemistry 
results, and the vast majority of best performing solvers had no 
exposure to biochemistry prior to joining Foldit community. Hence, 
solver behavior in Foldit represents development of highly effective 
problem-solving in an open-ended domain over long time horizons. 
In this work, we identify six strategic patterns employed by Foldit 
solvers and show how these patterns differentiate between successful 
and less successful solvers. These patterns cover instances where 
solvers investigate multiple hypotheses, explore more greedily or 
more inquisitively, try to escape local optima, and make structured 
use of the manual or automated tools available in Foldit. 


The aspects of the Foldit environment that make it an attractive 
setting in which to study problem solving also present significant 
challenges. Problems in Foldit share many of the properties Jonassen 
attributes to design problems, which they describe as “among the 
most complex and ill-structured kinds of problems that are encoun- 
tered in practice” [13]. These properties include a vague goal with 
few constraints (in Foldit, the goal is often entirely open-ended: 
find a good configuration of the protein), answers that are neither 
right or wrong, only better or worse, and limited feedback (in Foldit, 
real-time feedback and solution evaluation are limited to a single 
numerical score corresponding to the protein’s current energy state, 
and solvers frequently must progress through many low-scoring 
states to reach a good configuration; more nuanced feedback from 
biochemists is sometimes available, but on a timescale of weeks). 
The ill-structured nature of problems posed in Foldit necessarily 
deprives us of the structures, such as clear goal states and straight- 
forward relationships between intermediate states and goal states, 
that typically form the basis of existing detailed and quantitative 
analyses of problem-solving behavior. 


The size and complexity of Foldit’s problem space presents another 
major challenge. Even though the logs of solver interactions consist 
only of regular snapshots of a solver’s current solution (along with 
attendant metadata), the record of a single solver’s performance on 
a given problem frequently consists of thousands of such snapshots 
(which in turn are just a sparse sampling of the actual solving pro- 
cess). Furthermore, the nature of the solution state, the configuration 
of hundreds of components in continuous three-dimensional space, 
renders collapsing the state space by directly comparing solution 
states impractical. Compounding the size of the problem space is 
the complexity of the actions available to Foldit solvers. In addition 
to manual manipulation of the protein configuration, solvers can 
invoke various low-level automated optimization routines (some 


Proceedings of the 10th International Conference on Educational Data Mining 32 


of which run until the solver terminates them) and place different 
kinds of constraints on the protein configuration (rubber bands in 
Foldit parlance) that restrict its modification in a variety of ways. 
Solvers can also deploy many of these tools programmatically via 
Lua scripts called recipes. Taken together these challenges of ill- 
structuredness, size, and complexity threaten to make analysis of 
high-level problem-solving behavior in Foldit intractable. 


To overcome these obstacles, we devise a visualization-based method- 
ology capable of producing tractable representations of Foldit solvers’ 
problem-solving behavior while maintaining the key encodings nec- 
essary for analysis of high-level strategic behavior. A process of 

iterative summarization forms the core of this methodology, and 

ensures that the transformations applied to the raw data do not 

elide structures potentially relevant to understanding solvers’ unique 

strategic behavior. Using this methodology, we examine solver activ- 
ity logs from 11 Foldit puzzles, representing 970 distinct solvers and 

nearly 3 million solution snapshots. Leveraging metadata present 

in the solution snapshots, we represent solving behavior as a tree, 

and apply our methodology to visualize a summarized tree showing 

where they branched off to investigate multiple hypotheses, how 

they employed some of the automated tools available to them, and 

other salient problem-solving behavior. We use these depictions to 

determine key distinguishing features of this exploration process. 

We subsequently use these features to better understand the patterns 

of expert-level problem solving. 


Our work focuses on the following research questions: (1) how 
can we visually represent an open-ended exploration towards a 
high-quality solution in a large, ill-structured problem space? (2) 
what are the key patterns of problem-solving behavior exhibited 
by individuals?, and (3) what are the key differences along these 
patterns between high-performing and lower-performing solvers in 
an open-ended domain like Foldit? In addressing these questions we 
find that high-performing solvers explore the solution space more 
broadly. In particular, they pursue more hypotheses and actively 
avoid getting stuck in local minima. We also found that both high- 
and lower-performing solvers have similar proportion of manual and 
automated tool actions, indicating that better performance on open- 
ended challenges stems from the quality of the action intermixing 
rather than aggregate quantity. 


2. RELATED WORK 


While automated grading has mostly been explored for well-specified 
tasks where the correct answer has a straightforward and concise 
description, some previous work has developed techniques for more 
complex activities. Some achieve scalability through a crowd- 
sourcing framework such as Udacity’s system for hiring external 
experts as project reviewers [14]. Other work has demonstrated 
automated approaches that leverage machine learning to enable scal- 
able grading of more complex assignments. For example, Geigle et 
al. describe an application of online active learning to minimize the 
training set a human grader must produce [7] when automatically 
grading an assignment where students must analyze medical cases. 
Our work does not focus on grading problem-solving behavior, but 
instead approaches the issue of scalability at a more fundamental 
level: understanding fine-grained problem-solving strategies and 
how they contribute to success in an open-ended domain. 


A robust body of prior work has addressed the challenge of both 
visualizing and gleaning insight from player activity in educational 
and serious games. Andersen et al. developed Playtracer, a gen- 
eral method for visualizing players’ progress through a game’s 


state space when a spatial relationship between the player and the 
virtual environment is not available [1]. Wallner and KrigIstein pro- 
vide a thorough review of visualization-based analysis of gameplay 
data [21]. Prior work has analyzed gameplay data without visual- 
ization as well. Falakmasir et al. propose a data analysis pipeline 
for modeling player behavior in educational games. This system 
can produce a simple, interpretable model of in-game actions that 
can predict learning outcomes [6]. Our work differs in its aims from 
this prior work. We do not seek to develop a general visualization 
technique, but instead to design and leverage a domain-specific 
visualization to analyze problem-solving behavior. We are also 
not predicting player behavior, nor modeling players in terms of 
low-level actions, but rather identifying higher-level strategy use. 


The work most similar to ours is that which focuses on problem- 
solving behavior, including both the long-running efforts in edu- 
cational psychology to develop general theories and more recent 
work data-driven on understanding the problem-solving process. 
Our formulation of solving behavior in Foldit as a search through 
a problem space follows from classic information-processing the- 
ories of problem solving (e.g., [9, 19]). Gick reviews research on 
both problem-solving strategies and the differences in strategy use 
between experts and novices [8]. Our work complements the ex- 
isting literature by focusing on understanding problem solving in 
the little-studied domain of scientific-discovery games, and on the 
ill-structured problems present in Foldit. Our findings on the differ- 
ences in strategy use between high- and lower-performing solvers in 
Foldit are consistent with the consensus in the literature that expert’s 
knowledge allows them to effectively use strategies that are poorly 
or infrequently used by less-skilled solvers. We also contribute a 
granular understanding of the specific strategies and differences at 
work in the Foldit domain. 


Significant recent work has investigated problem-solving behavior 
in educational games and intelligent tutoring systems using a variety 
of techniques. Toth et al. used clustering to characterize problem- 
solving behavior on tasks related to understanding a system of linear 
structural equations. The clusters distinguished between students 
that used a vary-one-thing-at-a-time strategy (both more and less 
efficiently) and those that used other strategies [20]. Through a 
combination of automated detectors, path analysis, and classroom 
studies, Rowe et al. investigated the relationship between a set 
of six strategic moves in a Newtonian physics simulation game 
and performance on pre- and post-assessments. They found that 
the use of some moves mediated the relationship between prior 
achievement and post scores [18]. Eagle et al. discuss several ap- 
plications of using interaction networks to visualize and categorize 
problem-solving behavior in education games and intelligent tu- 
toring systems. These networks offer insight for hint generation 
and a flexible method for visualizing student work in rule-using 
problem solving environments [4] . Using decision trees to build 
separate models for optimal and non-optimal student performance, 
Malkiewich et al. gained insight into how learning environments 
can encourage elegant problem solving [17]. Our primary contri- 
bution is to extend analysis of problem-solving behavior to a more 
complex and open-ended domain that those studied in similar pre- 
vious work. The size and complexity of Foldit’s problem space, 
the volume of data necessary to capture exploration in this space, 
and the ill-structured nature of the Foldit problems all pose unique 
challenges. We devise a visualization-based methodology focused 
on iterative summarization, and successfully apply it to identify key 
problem-solving patterns exhibited by Foldit solvers. 
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3. FOLDIT 


Foldit is a scientific-discovery game that crowdsources protein fold- 
ing. It presents solvers with a 3D representation of a protein and 
tasks them with manipulating it into the lowest energy configura- 
tion. Each protein posed to the solvers is called a puzzle. Solvers’ 
solutions to each puzzle are scored according to their energy config- 
uration, and solvers compete to produce the highest scoring results. 


Score: [J of 8420 


Repeat guide? 
OK! 


Figure 1: The Foldit interface. Foldit solvers use a variety of 
tools to interactively reshape proteins. In this figure, a solver 
uses rubber bands to pull together two sheets, long flat regions 
of the protein. 


Solvers have many tools at their disposal when solving Foldit puz- 
Zles. They can manipulate and constrain the structure in various 
ways, employ low-level automated optimization (e.g., a wiggle tool 
makes small, rapid, local adjustments to try and improve the score), 
and trigger solver-created automated scripts called recipes that can 
programmatically use the other tools. There is, however, a subset of 
the basic actions that cannot be used by recipes. We will call these 
manual-only actions. Previous work analyzing solver behavior in 
Foldit has focused primarily on recipe use and dissemination [2] and 
recipe authoring [15]. 


Foldit has several different types of puzzles for solvers to solve. In 
this work, we focus on the most common type of puzzle, prediction 
puzzles. These are puzzles in which biochemists know the amino 
acids that compose the protein in question, but do not know how 
the particular protein folds up in 3D space. This is in contrast to 
design puzzles in which solvers insert and delete which amino acids 
compose the protein to satisfy a variety of scientific goals, including 
designing new materials and targeting problematic molecules in 
diseases. We focus on prediction puzzles in this work to simplify 
our analysis by having a consistent objective (1.e., maximize score) 
across the problem-solving behavior we analyze. 


4. METHODOLOGY 


Prior work has demonstrated the power of visualization to support 
understanding of problem-solving behavior (e.g., [12]). Hence, we 
devise a methodology capable of producing concise, meaning-rich 
visualizations of the problem-solving process in Foldit, and then 
leverage these visualizations to identify key patterns of solver be- 
havior. We are specifically interested in how solvers navigate from 
a puzzle’s start state to a high-quality solution, what states they 
pass through in between, and what other avenues they explored. 


Since solving a Foldit puzzle can be represented as a directed search 
through a problem space, the clear encoding of parent-child rela- 
tionships between nodes offered by a tree make it well-suited for 
visualizing these aspects of the solving process. 


The scale of the Foldit data necessitates significant transformation 
of the raw data in order to render concise visualizations. Without 
any transformation, meaningful patterns are overwhelmed by sparse, 
repetitive data and would be far more challenging to identify. While 
there are many existing techniques for large-scale tree visualization, 
we find clear benefits to developing a visualization tailored to the 
Foldit domain. Specifically, preserving the semantics of our visual 
encoding is crucial for allowing us to connect patterns in the visual- 
ization to concrete strategic behavior in Foldit. To accomplish this, 
the process by which concise visualization are constructed must 
be carefully designed to maintain these links. Hence, we devise a 
design methodology focused on iterative summarization. 


This process begins by visualizing the raw data. This is followed 
by iteratively building and refining a set of transformations to sum- 
marize the raw data while preserving meaning. The design of these 
transformations should be guided by frequently occurring structures. 
That is, those structures that the transformations can condense with- 
out eliding structures corresponding to unique strategic behavior. 
In parallel to this iterative design, a set of visual encodings are de- 
veloped to represent the solving process as richly as possible. Key 
to this entire process is frequent consultation with domain experts, 
in our case experts on Foldit and its community. By applying this 
iterative methodology for several cycles, we designed a domain- 
specific visualization that we use to identify patterns of strategic 
behavior among Foldit solvers. We follow up on these patterns with 
computational investigation, and quantify their application by high- 
and lower-performing solvers. 


4.1 Data 


For our analysis, we selected 11 prediction puzzles spanning the 
range of time for which the necessary data is available. Though 
Foldit has been in continuous use since 2010, the data necessary to 
track a solver’s progress through the problem space has only been 
collected since mid-2015. Our chosen dataset represents 970 unique 
solvers and nearly 3 million solution snapshots. These 11 puzzles are 
just a small subset of the available Foldit data. We chose a subset of 
similar puzzles (1.e., a subtype of relatively less complex prediction 
puzzles) in order to make common solving-behavior patterns easier 
to identify. The size of the subset was also guided by practical 
constraints, as each puzzle constitutes a large amount of data (20-60 
GB for the data from all players on a single puzzle). 


The data logged by Foldit primarily consists of snapshots of solver 
solutions as they play, stored as text files using the Protein Data 
Bank (pdb) format. These snapshots include the current protein 
pose, a timestamp, the solution’s score, the number of times the 
solver has invoked each action and recipe, and a record of the inter- 
mediate states that led up to the solution at the time of the snapshot. 
This record, or solution history, is a list of unique identifiers each 
corresponding to a previous solution state. This list is extended 
every time the solver undoes an action or reloads a previous solution. 
Hence, by comparing the histories of two snapshots from the same 
solver, We can answer questions about their relationship (e.g., does 
one snapshot represent the predecessor of another; where did two 
related snapshots diverge). The key relationship for the purposes of 
this analysis is the direct parent-child relationship, which we use to 
generate trees that represent a solver’s solving process. 
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4.2 Visualizing Solution Trees 

We applied our methodology to our chosen subset of Foldit data to 
design a visualization of an individual’s problem-solving process 
as a solution tree. Several key principles guided this design. First, 
since our goal is to discover key patterns, the visualization needs 
to highlight distinctly different strategies and approaches. These 
differences cannot be buried amidst enormous structures, nor de- 
stroyed by graph transformations. Second, the visualization must 
depict the closeness of each step to the ultimate solution in both time 
and quality to give a sense of the solver’s progression. Third, the 
solver’s use of automation in the form of recipes should be apparent 
since the use of automation is an important part of Foldit. 


The fundamental organization of the visualization is that each node 
corresponds to a solution state encountered while solving. Using the 
solution history present in the logged snapshots of solver solutions, 
we establish parent-child relationships between solutions. If solution 
B is a child of solution q@, it indicates that B was generated when 
the solver performed actions on @. One crucial limitation, however, 
is that a snapshot of the solver’s current solution is captured far less 
often (only once every two minutes) than the solver takes actions. 
This means that our data is sparsely distributed along a solution’s 
history going back to the puzzle’s starting state. Hence, when naively 
constructing the tree from the logged solution histories, it ends up 
dominated by vast quantities of nodes with no associated data. 


We address this issue by performing summarization on the solution 
trees, condensing them into concise representations amenable to 
analysis for important features. This summarization takes place 
in two stages. The first stage trims out nodes that (1) do not have 
corresponding data and (2) have zero children. This eliminates 
large numbers of leaf nodes that we are unable to reason about 
given that we lack the corresponding data. This stage also combines 
sequences of nodes each with only one child into a single node. For 
the median tree, this stage reduced the number of nodes by an order 
of magnitude from over 12,000 nodes to about 1,600. 


The second stage consists of four phases, each informed by our 
observations of common patterns in trees produced by the first stage 
that would benefit from summarization. The first phase, called 
prune, focuses on simplifying uninteresting branches. We observed 
many of the branches preserved by the first stage were small, with 
at most three children, and only continued the tree from one of 
those children. Prune removes the leaf children of these branches 
from the tree. Collapse, the second phase, transforms each of the 
sequences of single-child nodes left behind after prune into single 
nodes. The third phase, condense, targets another common pattern 
where a sequence of branches feed into each other, with a child of 
each branch the parent of the next branch. These sequences are 
summarized into a single node labeled CASCADE along with the 
depth (number of branches) and width (average branching factor) 
of the summarized branches. See Figure 2 for an example of the 
features summarized by these three phases. The final phase, clean, 
targets the ubiquitous empty nodes (1.e., nodes for which we lack 
associated data) shown in black in Figure 2. We eliminate them by 
merging them with their parent node, doing so repeatedly until they 
all have been merged into nodes that contain data. In addition to 
making the trees more concise, this step allows us to reason more 
fully over the trees since all nodes are guaranteed to contain data. 
This second stage of summarization further reduced the number of 
nodes in the median tree by another order of magnitude to about 
300 nodes. Summarization similarly reduces the space required to 
store the data by two orders of magnitude. 


Figure 2: A solution tree after only the first stage of summa- 
rization. The non-black node color represents the score of the 
solution at that node (red is worse). The black nodes are empty 
in that we do not have solution data corresponding to that node. 
This figure also shows examples of the features targeted by the 
second summarization stage: prune and collapse eliminate long 
chains like the one on the right, and condense combines se- 
quences of branches like those going down to left in single CAS- 
CADE nodes. 


Child-parent relationships are not the only part of the data we visu- 
ally encoded in the solution trees. Nodes are colored on a continuous 
gradient from red to blue according to the score of the solution rep- 
resented by that node (red is low-scoring, blue is high-scoring). The 
best-scoring node is highlighted as a yellow star. Edges are colored 
on a continuous gradient from light to dark green according to the 
time the corresponding transition took place, and the children of 
each node are arranged left to right in chronological order. Finally, 
use of automation via recipes is an important aspect of problem- 
solving in Foldit. Since the logged solution snapshots contain a 
record of which recipes have been used at that point, we can use this 
to annotate nodes where a recipe was triggered. The annotations 
consist of the id of that recipe (a 4 to 6 digit number) and the number 
of times it was started. 


One major weakness in the data available to us is the lack of a con- 
sistent way to determine when the execution of a recipe ended (some 
recipes save and restore, possibly being responsible for multiple 
nodes in the graph beyond where they were triggered). We partially 
address this by further annotating a node with the label MANUAL 
whenever the solver took a manual-only action at that node. This 
indicates that no previously triggered recipe continued past that node 
because no recipe could have performed the manual-only action. 
Since nodes in the summarized trees can represent many individual 
steps, it is possible for them to have several of these recipe and 
manual action annotations. 


5. RESULTS 


Using visualized solution trees for a large set of solvers across our 
sample of 11 puzzles, we identify a set of six prominent patterns in 
solvers’ problem-solving behavior. These patterns do not encompass 
all solving behavior in Foldit, but instead capture key instances of 
strategic behavior in three categories: exploration, optimization, and 
human-computer collaboration. Future work is needed to generate 
a comprehensive survey of the strategic patterns in these and other 
categories. In this analysis, our focus is on identifying a small, 
diverse set of commonly occurring patterns to both provide initial 
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insight into problem-solving behavior, and to demonstrate the poten- 
tial of our approach. In addition to identification, we also perform 
a quantitative comparison of how these patterns are employed by 
high-performing and lower-performing solvers to gain an under- 
standing of how these patterns contribute to success in an open-end 
environment like Foldit. 


5.1 Problem-Solving Patterns 


Exploration. Foldit solvers are confronted with a highly discon- 
tinuous solution space with many local optima, creating a trade-off 
between narrowly focusing their efforts or taking the time to explore 
a broader range of possibilities. In our first two patterns, we exam- 
ine the broader exploration side of this trade-off at two different 
scales. Taking the macro-scale first, we identify a pattern where 
solvers make significant progress on distinct branches of the tree 
(see Figure 3 for an example). We interpret this pattern as the solver 
investigating multiple hypotheses about the puzzle solution, using 
multiple instances of the game client or Foldit’s save and restore fea- 
tures to deeply explore them all. We call this the multiple hypotheses 
pattern. 


Figure 3: An example of the multiple hypotheses pattern. The 
two hypotheses branch out one of the nodes at the top and con- 
tinue to the left (A) and right (B). 


At the micro-scale, solvers very frequently generate a large number 
of possible next steps (i.e., a branch with a large number of children), 
but most often proceed to explore only one of them further. This is 
natural given the iterative refinement needed to successfully partici- 
pate in Foldit. Hence, solvers that exhibit a pattern of much more 
frequently exploring multiple local possibilities demonstrate an un- 
usual effort to explore more broadly. We call this the inquisitive 
pattern. Figure 4 shows an example of this behavior. 
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Figure 4: An example of the inquisitive pattern. Note how fre- 
quently multiple children of the same node are explored when 
compared to the tree in Figure 3. 


Optimization. Navigating the extremely heterogeneous solution 
space is the primary challenge in Foldit, so we look closely at how 
solvers attempt to optimize their solutions, digging deeper into 
solvers’ approach to exploration than the previous two patterns. 
We identify two related patterns describing solvers’ fine-grained 
approach to optimization. The solution spaces of Foldit puzzles 
contain numerous local optima that solvers must escape, and we 
identify an optima escape pattern highly suggestive of a deliberate 
attempt to escape a local optima. This pattern occurs when a solver 


has a high-scoring node with a low-scoring child, and then chooses 
to explore from the low-scoring child. The solver was willing to 
ignore the short-term drop in score to try and reach a more beneficial 
state in the long-term. Figure 5 gives an example of this pattern. 
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Figure 5: An example of the optima escape pattern. The solver 
transitions from a relatively high-scoring (i.e., blue) state in the 
upper left to a low-scoring (i.e., red) state. What makes this 
an example of the pattern is that exploration from the low- 
scoring state. In this case, the perseverance paid off as the 
solver reaches even higher-scoring states in the lower right. 


In the other direction, we identify the greedy pattern in which solvers 
exclusively explore from the best-scoring of the available options. 
Obviously, some amount of greedy exploration is necessary in order 
to refine solutions, but in its extreme form deserves recognition 
as a pattern with significant potential impact on problem-solving 
success. Naturally, these two patterns do not cover all the ways 
solvers explore the problem space, but they do characterize specific 
strategic behavior of interest in this analysis. 
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Figure 6: An example of the repeated recipe pattern. At three 
points in this solution tree snippet, the solver applies recipe 
49233 to every child of a node. 


Human-computer collaboration. Human-computer collabo- 
ration is a vital part of Foldit, and managing the trade-off between 
automation and manual intervention is a key feature of solving 
Foldit puzzles. We identify two patterns that each focus on one 
side of this trade-off. The first, the manual pattern, corresponds to 
extended sections of exclusively manual exploration. Since recipe 
use 1s very common, extended manual exploration represents a sig- 
nificant investment in the manual intervention side of the trade-off. 
Limitations with Foldit logging data prevent us from capturing all 
the manual exploration (1.e., it is not always possible to determine 
whether an action was performed by a solver manually or triggered 
as part of an automated recipe), but what can be captured is still an 
important dimension of variance among problem-solving behavior. 


Our final pattern concerns recipe use. Some solvers apply a recipe 
to every child of a node periodically throughout their solution tree, 
using it as a clean-up or refinement step before continuing on (see 
Figure 6). We call this the repeated recipe pattern. Recipe use is 
very diverse and frequently doesn’t display any specific structure, 
making this pattern interesting for its regimented way of managing 
some of the automation while solving. 
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Figure 7: The number of hypotheses pursued in each solution 
tree for high- and lower-performing solvers. High-performing 
solvers frequently pursue two or more hypotheses, whereas 
lower-performing solvers most often pursue just one. Red cir- 
cles show the distribution of individual solvers. 


5.2 Problem-Solving Patterns and 


Solver Performance 

To understand how the patterns we identify relate to skillful problem- 
solving in an open-ended domain like Foldit, we compare their use 
among high-performing solvers to that among lower-performing 
solvers. Specifically, we analyze the occurrence of these patterns in 
the 15 best-scoring solutions from each puzzle and compare that to 
the occurrence in solutions from each puzzle ranked from 36th to 
50th. Though it varies somewhat between puzzles, in general the 
solutions ranked 36th to 50th represent a middle ground in terms 
of quality. They fall outside the puzzle’s state-of-the-art solutions, 
but remain well above the least successful efforts. Throughout these 
comparisons we use non-parametric Mann-Whitney U tests with 
a = (0.008 confidence (Bonferroni correction for six comparisons, 
a = 0.05/6), as our data is not normally distributed. For each test, 
we report the test statistic U, the two-tailed significance p, and the 
rank-biserial correlation measure of effect size r. In addition, since 
some of the metrics we compute may not apply to all solution trees 
(e.g., the tree contains no branches where the inquisitive pattern 
can be evaluated), we report the number of solvers involved in the 
comparison n for each test (the full sample is n = 330). 


We find high-performing solvers explore more broadly than lower- 
performing solvers. For the multiple hypotheses pattern, high- 
performing solvers pursued significantly more hypotheses than 
lower-performing solvers (U = 10569, p = 0.000014, r = 0.217, 
n = 330) (see Figure 7). For the inguisitive pattern, we compute 
the proportion of each solver’s exploration that matches the pattern 
(i.e., of all the branches in a solver’s solution tree, in what frac- 
tion of them did the solver explore more than one child) and find 
high-performing solvers explore inquisitively more often than lower- 
performing solvers (U = 9343, p = 0.000295, r = 0.231, n = 313) 
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Figure 8: The proportion of all the branches in a solver’s so- 
lution tree in which the solver explored more than one child 
for high- and lower-performing solvers. Red circles show the 
distribution of individual solvers. 


(see Figure 8). 


We also find high-performing solvers work harder to avoid local 
optima. For the optima escape pattern, we compute the num- 
ber of times this behavior occurs in each solution and find that 
high-performing solvers engage in this behavior more than lower- 
performing solvers (U = 11183.5, p = 0.00185, r= 0.173, n = 330) 
(see Figure 9). For the greedy pattern, we compute the propor- 
tion of each solver’s exploration that matches the pattern (1.e., of 
all the branches in a solver’s solution tree, in what fraction of 
them did the solver only explore the best-scoring child). While 
high-performing solvers engaged in greedy optimization less often 
than lower-performing solvers, the difference was not significant 
(U = 9079, p = 0.0158, r = —0.163, n = 295) (see Figure 10). 


Finally, we find no significant difference between high- and lower- 
performing solvers in the frequency they manually explore and 
employ recipes. For the manual pattern, we compute the number of 
manual exploration sections in each solution and find no significant 
difference between high- and lower-performing solvers (U = 13334, 
p = 0.789, r= 0.014, n = 330). For the repeated recipe pattern, 
we computed the median frequency of recipe use along all paths 
in the solution (i.e., for each path from the root to a leaf, in what 
fraction of the nodes did the solver trigger at least one recipe) and 
though lower-performing solvers used recipes more frequently, the 
difference between high- and lower-performing solvers was not 
significant (U = 11342, p = 0.0140, r = —0.157, n = 329). 


6. DISCUSSION 


The results from our analysis of our solution tree visualizations illu- 
minate some key problem-solving patterns exhibited by individual 
Foldit solvers. Namely, how broadly an individual explores, both 
on a macro- and micro-scale, how actively an individual avoids 
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Figure 9: The number of times in each solution a solver en- 
gages in optima escape behavior for high- and lower-performing 
solvers. Red circles show the distribution of individual solvers. 


local optima by engaging in less greedy optimization and actively 
pursuing locally suboptimal lines of inquiry, and how an individual 
manages the interplay between automation and manual intervention. 


Comparing high- and lower-performing solvers in their applica- 
tion of these patterns suggests that skillful problem-solving in an 
open-end domain like Foldit involves broader exploration and more 
conscious avoidance of local minima. This finding that a key feature 
of high-skill solving behaviors is not being enamored by the current 
best solution and possessing strategies for avoiding myopic thinking 
had implications for the strategies that should be taught to develop 
successful problem solvers. Further work is required on other large 
open-ended domains to confirm this trend. 


The finding that solvers of different skill use greedy exploration, 
manual exploration, and automation in similar amounts suggests 
skillful deployment of non-greedy exploration, automation, and 
manual intervention takes place at a more fine-grained level than 
overall quantity. Though this work focuses on the presence or 
absence of specific solving behavior, the timing and sequencing of 
strategic moves are likely to be critical to success. Further work is 
needed to investigate what differentiates effective and ineffective 
use of specific solving strategies. 


The Foldit dataset itself presented significant challenges for our 
analysis, and we addressed these through an iterative visualization- 
based methodology. This process served as a design method for 
generating a visual grammar to describe a complex problem-solving 
process. We do not study the generalization of this approach to 
other datasets and domains in this work, but the prerequisites for 
its application to other open-ended problem-solving domains can 
be concisely enumerated: (1) the logs of solver activity establish 
clear temporal relationships between solution states such that those 
states can be visualized as a progression through the solution space, 
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Figure 10: The proportion of all the branches in a solver’s so- 
lution tree in which the solver explored only the best-scoring 
child for high- and lower-performing solvers. The fact that the 
median for both categories of solver is above 0.5 indicates that 
this pattern in an important part of refining solutions in Foldit. 
Red circles show the distribution of individual solvers. 


(2) the solution state or associated metadata is amenable to visual 
encoding, so that the visualized progressions can represent fine- 
grained details of the solving process, and (3) deep problem-solving 
domain expertise is available to provide the necessary context for 
interpreting and summarizing the visualized structures. 


Our chosen subset of Foldit data represents only a small fraction 
of the total available data. In particular, we limited our analysis 
to a sample of similar prediction puzzles, and compared specific 
ranges of high- and lower-performing solvers. Though these choices 
are well-motivated, it is an important question for future work as 
to whether our results hold across different datasets and groups of 
comparison. More broadly, Foldit supports numerous variations 
on the prediction and design puzzle archetypes, which offers an 
exciting opportunity to study problem solving across a number of 
related contexts with varying goals, constraints, inputs, and tools. 


7. CONCLUSION 


Gaining a better understanding of key patterns in problem-solving 
behavior in complex, open-ended environments is important for de- 
ploying this kind of activity in an educational setting at scale. In this 
work, we identified six key patterns in problem-solving behavior 
among solvers of Foldit. The protein folding challenges in Foldit 
present rich, completely open, heterogeneous solution spaces, mak- 
ing them a compelling domain in which to analyze these patterns. 
To facilitate the identification of these patterns, we used an iterative 
methodology to design visualizations of solvers’ problem-solving 
activity as solution trees. The size and complexity of the Foldit data 
required us to develop domain-specific techniques to summarize the 
solution trees and render them tractable for analysis while preserv- 
ing the salient problem-solving behaviors. Finally, we compared the 
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occurrence of the patterns we identified between high- and lower- 
performing solvers. We found that high-performing solvers explore 
more broadly and more aggressively avoid local optima. We also 
found that both categories of solvers employ automation and manual 
intervention in similar quantities, inviting future work to study how 
these tools are used at a more fine-grained level. 


We have only scratched the surface in our analysis of a subset of 
Foldit data. Two integral aspects of the Foldit environment are 
not within the scope of this work: collaboration and expert feed- 
back. We only considered solutions produced by individual solvers, 
but Foldit solver can also take solutions produced by others and 
try and improve them. This collaborative framework may involve 
specialization and unique solving strategies, and deserves careful 
study. Expert feedback comes into play for design puzzles, where 
biochemists will select a small number of the solutions to try and 
synthesize in the lab. Experts will also impose additional constraints 
on future design puzzles to try and guide solutions toward more 
promising designs. The interaction of these channels for expert 
feedback and problem-solving behavior is an important topic for 
future research. Also outside the scope of this work is how individ- 
ual solvers change their problem-solving behavior over time. Many 
solvers have been participating in the Foldit community for many 
years, and studying how their behavior evolves could yield insights 
into the acquisition of high-level problem-solving skills. 


Looking more broadly at the impact of this work, our methodology 
and analysis can serve as a first step toward discovering the scaffold- 
ing necessary to develop high-level problem-solving skills. These 
results could contribute to a hint generation system, where solvers 
could be guided toward known effective strategies, or a meta-planner 
component in Foldit that could tailor the parameters of particular 
puzzles to optimize the quality of the scientific results. In all of 
these cases, this work contributes to the necessary foundational 
understanding of the problem-solving behavior involved. 
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ABSTRACT 


Replayability has long been touted as a benefit of educa- 
tional games. However, little research has measured its im- 
pact on learning, or investigated when students choose to re- 
play prior content. In this study, we analyzed data on a sam- 
ple of 4,827 3rd-5th graders from ST Math, a game-based ed- 
ucational platform integrated into classroom instruction in 
over 3,000 classrooms across the U.S. We identified features 
that describe elective replays relative to prior gameplay per- 
formance, and associated elective replays with in-game accu- 
racy, confidence, and general math ability assessments out- 
side of the games. We found some elective replay patterns 
were associated with learning, whereas others indicated that 
students were struggling in the current educational content. 
We suggest, therefore, that educational games should use 
elective replay behaviors to target interventions according 
to when and whether replay is helpful for learning. 


Keywords 
Educational Games, Serious Game Analytics, Replayability 


1. INTRODUCTION 


“Replayability is an important component of successful games.” 


[15] In most games, there are two types of plays: play and 
replay to pass a level (pass attempts) and replay after pass- 
ing a level (elective replay). In this paper, we investigate 
the latter. Elective replay (ER) is particularly interesting 
because the motivations behind a student’s decision to re- 
play and the impact of those replays are relatively unknown. 
This paper explores potential associations between elective 
replay and student characteristics and performance in the 
domain of educational games. 


Replayability has been touted as a benefit of educational 
games [9]. Replayability encourages players to engage in 
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repeated judgement-behavior-feedback loops, where users 
make decisions based on the situation and/or feedback, act 
on those decisions, and receive feedback based on their ac- 
tions [18]. In the RETAIN model designed by Gunter et al. 
[10] to evaluate educational games, replayability is a crite- 
ria for naturalization — an important component in helping 
students make their knowledge automatic, reducing the cog- 
nitive load of low-level details to allow for higher order think- 
ing. In the RETAIN model, “replay is encouraged to assist 
in retention and to remediate shortcomings.” [10] Mean- 
ingful elective replay is often encouraged by game features 
such as score leaderboards, which inspire students to re- 
play for higher scores [4]. Because higher scores typically 
require a deeper understanding of the educational content 
in a well-designed game, encouraging elective replay may 
promote mastery. Games with replay also allow the stu- 
dent to be exposed to more material and give them more 
freedom to control their learning. Studies have shown that 
giving students control over their learning process can in- 
crease motivation, engagement, and performance [6, 8]. 


However, few studies have investigated when students choose 
to replay, why they do so, or have measured the outcomes as- 
sociated with elective replay. One reason is that educational 
game studies are often comparatively brief, so replayability 
is often minimally assessed with post-game questionnaires 
asking about students’ intention for future play [14, 5]. Con- 
sequently, there is a need to investigate elective replay with 
actual logged actions in a game setting where students have 
sufficient time and freedom to replay. 


This work analyzed gameplay logs from a series of math 
games within the year-long supplemental digital mathemat- 
ics curriculum Spatial Temporal (ST) Math. We analyzed 
gameplay data from 4,827 3rd-5th graders throughout the 
2012-2013 school year. Our data contained 37,452 logged 
elective replays, accounting for 1.48% of the logged play. 
We analyzed gameplay and elective replay features in as- 
sociation with students’ demographic information, in-game 
math objective tests, and the state standardized math test. 
We sought to answer three research questions: Q1: What are 
the characteristics of students who engage in elective replay, 
Q2: What gets replayed, and under what circumstances? 
And Q3: Is elective replay associated with improvements 
in students’ accuracy on math objectives, confidence, and 
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general math ability? 


2. RELATED WORK 
2.1 Factors Influencing Elective Replay 


Few empirical studies have investigated the motivations be- 
hind elective replay in educational games. Burger et al. [5] 
studied the effect of verbal feedback from a virtual agent 
on replay in the context of a brain-training game. They 
found that elaborated feedback increases, whereas compar- 
ative feedback decreases, the students’ interest in future re- 
play. They also found that negative feedback generated an 
immediate interest in replay, whereas positive feedback cre- 
ated long term interest in the educational content. In an- 
other study, Plass et al. [14] compared three conditions in a 
math game: working individually, competing with another 
player, or collaborating with a peer. ‘The study showed that 
both competition and collaboration modes heightened stu- 
dents’ intention to replay when compared with the individ- 
ual mode, with the latter result being statistically signifi- 
cant. However, both studies measured replay via question- 
naires asking the students’ desire to play the entire game 
again instead of observed replay behavior. Moreover, these 
studies sought to understand replay only from the angle of 
game design, and did not address the connections, if any, 
between student characteristics and interest in replay. 


Other studies suggest elective replay is a habitual behavior 
that arises from individual need, although these studies did 
not directly investigate replay. Bartle [3] found one type 
of player who is primarily motivated by concrete measure- 
ments of success. In ST Math, these achiever-type players 
may largely use replay to get better ’scores’ (losing fewer 
lives when passing a level). Mostow et al. [12] observed 
a student in a reading tutor who used the learner-control 
features to spend the majority of time replaying stories or 
writing “junk” stories instead of progressing to new mate- 
rial. Thus, some students may also use replay as a form 
of work avoidance — playing already passed levels instead 
of solving the current problem or moving on. Sabourin et 
al. [17] found that students in an educational game used 
off-task behaviors to cope with frustration, implying that 
off-task behavior can be a productive self-regulation of neg- 
ative emotions. In ST Math, when students get frustrated 
with the current educational content but still have to play 
the game in the classroom, they may replay already learned 
content as a mental break from the current task. These 
studies showed that the circumstances of replay and stu- 
dents’ characteristics influence their decisions to replay and 
its outcomes. 


2.2 The Outcomes of Replay 

Despite the believed benefits of replayability [9, 18, 10, 4], 
few studies have investigated the educational impact of elec- 
tive replay. Boyce et al. [4] evaluated the effects of game 
elements that were designed to motivate gameplay and elec- 
tive replay. These included a leaderboard that shows each 
student’s rank based upon their score, a tool for creating 
custom puzzles, and a social system for messaging among 
players. The experimental design required students to play 
the game in one session, and to replay the game as more 
features were added in the subsequent sessions. The study 
found a sharp increase in test scores as these features were 
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added to the game. The authors concluded that features de- 
signed to increase replayability can increase learning gains. 
However, this result may be due to increased time on task as 
the same group replaying the base game with new features. 
In another study, Clark et al. [7] analyzed logged student- 
initiated elective replay in a digital game. They found that 
frequency of elective replay did not correlate with learning 
gains, prior gaming habits/experience, or how much stu- 
dents liked the game. They also found that, while there 
was no statistically significant difference between the male 
and female students, males replayed more than the females. 
This may have been responsible for their slightly higher, al- 
though not statistically significant, “best level scores” — the 
highest score received on each level. These studies showed 
that elective replay may lead to increased learning or higher 
in-game performance. However, more research is needed to 
understand the potential educational impact of replay in ed- 
ucational games, particularly elective replays initiated solely 
by the players. 


3. GAME, DATA AND FEATURES 
3.1 ST Math Game 


| A Multiplication Concepte & Teet Dive 
| @ Division Concepts & Test Drive 
| & Multiplication and Division Situations QB Test Drive 
| & = Multiplication and Division Relationships GB lest Unve 
ah. 


| @ Concepts of Aree and Perimeter GB Test Drive 


ot 
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Figure 1: ST Math Content and Examples 


ST Math is designed to act as a supplemental program to 
a school’s existing mathematics curriculum. ST Math is 
mostly played during classroom sessions, but students have 
the option to play it at home. In ST Math [16], mathematics 
concepts are taught through spatial puzzles within various 
game-like arenas. ST Math games are structured at the top 
level by objectives, which are broad learning topics. Within 
each objective, individual games teach more targeted con- 
cepts through presentation of puzzles, which are grouped 
into levels for students to play. Students start by complet- 
ing a series of training games on the use of the ST Math 
platform and features. They are then guided to complete 
the first available objective in their grade-level curriculum, 
such as “Multiplication Concepts.” Students can only see 
this objective and must complete a pre-test before beginning 
the content. Games represent scenarios for problem-solving 
using a particular mathematical concept, such as “finding 
the right number of boots for X animals of Y legs.” Each 
game contains between one and ten levels, which follow the 
same general structure of the game, but increase in difficulty. 
Figure 1 illustrates the hierarchy of ST Math content and 
examples. 


As with many games, the student is given a set number of 
‘lives’ at the start of each level. Every time they fail to 
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complete a puzzle correctly they lose one life. If all of their 
lives for a given level are exhausted, they will fail the level 
and be required to restart the level with a new set of lives. 
Once a student has passed a level, they can elect to replay 
it at any time. After a student has passed every level in an 
objective, they can take the objective post-test. Students 
cannot progress to the next objective until they have com- 
pleted the last objective post-test. Both the objective pre- 
and post-tests consist of 5-10 multiple choice questions re- 
lated to the objective. The post-tests parallel the pre-tests 
in both the question format and difficulty of the content. 
While answering each question in both tests, students indi- 
cate their relative confidence in their answer (low/high). 


3.2 Data 

MIND Research Institute (MIND), the developers of ST- 
Math, collected and provided to the researchers gameplay 
data from 4,827 3rd-5th graders during the school year 2012- 
2013. These students came from 17 schools and 221 class- 
rooms. Table 1 summarizes students’ demographic informa- 
tion. These demographic data, together with students’ state 
standardized test scores in 2012 and 2013, were matched to 
gameplay data through anonymized IDs. 


Table 1: Populations’ Demographics Information 
Grade3 Grade4 Graded 


#Students 1567 1528 1732 
ats 50.6% 50.1% 52.2% 
na:2.9% na:2.0%  na:3.5% 
Eligible for Reduced = 80.7% = 77.8% = 81.4% 
Lunch na:2.9% na:2.1%  na:3.2% 
Hispanic or Latino Bai e230 p20 
na:2.8%  na:l.9%  na:3.1% 
English Language —-66.2% = 56.1% = 53.0% 
Learner na:2.9% na:2.1%  na:3.2% 
with Listed Disability 351% na 7% — nac2.8% 


This gameplay data includes pre- and post-tests for each 
objective and the number of level attempts. For each pre- 
and post-test, ST Math logged students’ accuracy and self- 
reported confidence level (1 for ’high’ and 0 for ’low) for 
each question. For each play at a level, ST Math logged the 
student’s ID, timestamp, and the number of puzzles com- 
pleted. From these data, we identified ER as plays made 
after a student initially passed the level. We found ERs in 
89.6% of all objectives in ST Math, accounting for 1.48% 
of all level attempts. Among 4,827 students, 59.85% ERed 
at least one level, with an average of 7.84 levels (SD=12.99, 
95% CI [7.37, 8.32]) across 3.06 average objectives replayed 
per student. In the next section, we describe the features 
we created to analyze ER. 


3.3. Features 

We created features at three different levels of granularity 
(from finest to largest): level, objective, and student. For 
the level granularity, we treated each unique student-level 
combination as an observation. We calculated the features 
by averaging all gameplay for a specific student at a spe- 
cific level. For objective granularity, each unique student- 
objective combination was treated as a single observation. 
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Features were created by averaging across all levels played by 
a specific student within a single objective. The objective 
granularity also included the objective pre- and post-test 
accuracy and confidence. For the student granularity, we 
treated each student as a single observation. We calculated 
the features by averaging across all objectives played by a 
student over the entire year. The student granularity also 
included student demographic data and state standardized 
math test scores. ‘These granularities ensured that our anal- 
ysis did not favor units with the majority of data logs. Each 
student was considered equally in our analysis, regardless 
of how many objectives they played. Our data contained 
4,827 students and 2,524,681 plays, which yielded 1,462,660 
student-level observations, and 74,985 student-objective ob- 
servations. 


Table 2 shows five example plays of “Division-Level3,” in- 
cluding four pass attempts and one ER of this level, inter- 
spersed with ERs from other levels. We consider consecutive 
ERs as an ER Session, as these ERs are circumstanced on 
the same pass attempts. 


Table 2: Example of ER and Pass Attempts 
Play Objective-Level Passed? Play Type 


1 Division- Level3 No Pass Attempt 
Z Division- Level3 No Pass Attempt 
3 Division-Level1 Yes ER (ER Session1) 
4 Division- Level3 No Pass Attempt 
5 Division-Level1 Yes ER (ER Session2) 
6 Division- Level3 Yes Pass Attempt 
7 Division- Level3 Yes ER (ER Session3) 
8 Subtraction-Levell No ER (ER Session3) 


3.3.1 Pass Attempt Features 

We defined performance to be the percentage of puzzles a 
student completed before losing all lives on the level. Pass 
attempts are plays prior to ER, where we assumed stu- 
dents play with the intention of passing the level. Pass at- 
tempt features included: performance when a student first 
attempted a level (1st pass attempt performance), number of 
attempts taken to pass a level (# pass attempts), and aver- 
age performance of all pass attempts (average pass attempt 
performance). At the student granularity, students took an 
average of 1.91 (sd=0.89) attempts to pass each level, with 
average performance of 0.80 (sd=0.10) on the first pass at- 
tempt, and 0.87 (sd=0.07) on all pass attempts (indicating 
overall improved performance on later attempts). 


3.3.2 Elective Replay Features 

Table 3 shows ER features that describe ER from three an- 
gles: (I) the frequencies of ER, (II) the performance of ER, 
and (III) the circumstances of ER in terms of the ER’s prior 
plays. ‘To summarize, the majority of ERs had higher per- 
formance than their levels’ first attempt, and resulted in 
another pass of their levels. Levels that were ERed had sim- 
ilar performance compared to levels that weren’t ERed, but 
levels that were followed(54.65%) or interrupted (54.35%) 
by ER had much lower performance than those that weren’t 
followed or interrupted by ER. Most ERs’ immediately prior 
pass attempts were from different levels or objectives. There 
were few instances (9.80%) where students passed a level and 
immediately ERed it following the pass. 
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Table 3: Elective replay (ER) Features and their Descriptive Statistics among Students who Electively Re- 


played, Collapsed to the Student Granularity. 


ER Features 


I. Frequencies of ER 
% ER out of all plays 
% Objectives that have been electively replayed 


% Objectives whose pass attempts were interrupted/followed by ER 


II. Performance of ER 

Performance of ER 

% ERs performed better than the level’s first attempt 
% ERs that result in another pass of the level 

III. Circumstances of ER 


| Descriptive Stats 


M=2.40%, SD=4.26% 
M=22.94%, SD=20.89% 
M=19.48%, SD=17.57% 


M=0.71, SD=0.28 
M=71.96%, SD=31.44% 
M=60.36%, SD=35.51% 


The Replayed Level E.g. “Division-lvl1,” “Division-lvl3,” and “Subtraction-lvl1” in Table 2 


Pass Attempts Features 


M=0.79, 1.98, 0.87 for lst performance, #pass at- 
tempts, and avg performance 


The Immediately-Prior play of the ER E.g. Play 2 is the immediately-prior play of play 3 in Table2 


Performance on the immediately-prior play 
% ERs whose immediately-prior plays is also an ER 


% ER whose immediately prior pass attempt is on the same level 


Vor svt on a different level in the same objective 
UO edocs on a different objective 


M=0.63, SD=0.29 
M=0.31, SD=0.28 
M=9.80%, SD=23.84% 
M=40.75%, SD=39.09% 
M=49.44%, SD=40.76% 


The Immediate Prior Pass Attempts followed or interrupted by ER and ER Session E.g. “Division-lvl3” for 


all ER Sessions in Table 2 
Pass Attempts Features 


% ER sessions whose prior pass attempt passed the level 


M=0.51, 3.62, 0.55 for lst performance, #pass at- 
tempts, and avg performance 
M=45.65%, SD=40.69% 


Note. statistics are reported at the student granularity, which are calculated through averaging across all objectives played by a student, 
and then averaged across all students who electively replayed. This means each student contributes equally to the average, regardless of 


how many objectives s/he played. 


3.3.3 Student Grouping From ER Features 

We created student groups to encapsulate the circumstances 
under which ER occurred, based on students’ majority ER 
and ER sessions. Based on prior literature, we hypothesized 
that ER is a habitual behavior that arises from individual 
needs, such as gaining higher scores [3], avoiding progress on 
the current task [12], or taking a mental break from nega- 
tive emotions [17]. Thus, grouping students based upon the 
circumstances of replay based on their majority behaviors 
provides high level profiles to investigate characteristics of 
students who engaged in ER and benefited from ER. 


We characterized ER by the timing relative to the student’s 
current learning objectives and gameplay. The first group- 
ing describes whether the majority ER sessions started be- 
fore (Group B) or after (Group A) passing the previous at- 
tempted level (current learning objective). If there is a tie 
between the two types of replay session, the student be- 
longs to neither group. For example, Table 2 describes a 
group B student, who has two replay sessions before passing 
“Division-level3,” and one replay session after passing this 
level but before moving on to the next level. 


The second grouping describes whether an ER followed plays 
on the same level (SL), a different level under the same ob- 
jective (DLSO), or a different objective (DO). For our ex- 
ample in Table 2, the student’s pass attempts on “Division- 
Level3” was interrupted twice on the third and fifth plays, by 
replays on “Division-levell”(DLSO). After passing “Division- 
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level3”, the student replayed the same level(SL) once during 
the seventh play, and a different objective “ Subtraction- 
levell” (DO) once during the eighth play. This Group B 
student had two DLSO replays, one SL, and one DO replays. 
Thus, this student also belongs to Group DLSO, because the 
two groupings are independent of each other. 


4. METHODS & RESULTS 
4.1 Who Engaged in Elective Replay? 


We first investigated the demographic characteristics of stu- 
dents who engaged in elective replay. We found that males 
did so more often than females (male: 63.2%, female: 57.0%, 
c2(1, N=4827) = 17.99, p<.001). We also found that English 
Language Learners (ELL) did so more often than their non- 
ELL peers (ELL: 62.3%, non-ELL: 57.1%, c2(1, N=4827) 
= 12.69, p<.001 ), as did students with reported disabil- 
ities (disability: 68.7%, non disability: 59.1%, c2(1, N = 
4827) = 18.17, p<.001). There were no statistically sig- 
nificant differences in the frequencies of ER based on race 
when operationalized as Hispanic/non Hispanic, or based 
on free/reduced lunch eligibility. The frequency of ER was 
not found to be correlated with other out-of-game student 
factors, such as state standardized math test scores. 


The frequency of ER was also not correlated with in-game 
pre-test accuracy and confidence at the objective granular- 
ity. Next, we investigated the gameplay characteristics of 
students who electively replayed. We first separated stu- 
dents into groups based on their replay patterns. The first 
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Table 4: Mann-Whitney U Tests Comparing Gameplay Characteristics between ER Pattern Student Groups 


Group (# stu- | Pre-test Pre-test Avg Pass At- Avg ist At- #Pass At- ER  Perfor- 
dents) Accuracy Confidence tempts’ Per- tempt Per- tempts mance 
formance formance 

Base:No ER M=0.61 M=0.75 M=0.88 M=0.81 M=1.82 NA 
(N=1938) SD=0.17 SD=0.23 SD=0.08 SD=0.11 SD=0.84 

*M=0.57 M=0.74 i= (eS *M=0.80 *M=1.92 M=0.72 
BRAN=2889) SDH 017 te SD—0.24 & SD=007 SD=0.10 SD=0.78  SD=0.29 
Group A M=0.62 M=0.77 *M=0.90 *M=0.84 Sine a Ea ieee 
(N=1114) SD=0.16 SD=0.22 SD=0.05 SD=0.08 SD=0.52 SD=0.27 
Group B i= O52 *M=0.72 *M=0.84 *M=0.75 hi=2o8 *M=0.67 
(N=1464) SD=0.17 SD=0.25 SD=0.07 SD=0.09 SD=1.09 SD=0.29 
Group SL M=0.61 M=0.75 M=0.88 M=0.81 M=1.82 *M=0.84 
(N=173) SD=0.17 SD=0.23 SD=0.07 SD=0.09 SD=0.81 SD=0.29 
Group DLSO M=0.54 M=0.73 M=0.84 M=0.76 27 M=0.67 
(N=983) SD=0.18 SD=0.24 SD=0.08 SD=0.10 SD=1.16 SD=0.32 
Group DO *M=0.58 M=0.75 M=0.88 M=0.81 M=1.80 M=0.73 
(N=1399) SD=0.16 SD=0.23 SD=0.06 SD=0.08 SD=0.71 SD=0.26 


Note. 1) Green and red indicate statistically significances higher and lower than the base class, with *p < .001, +p < .01 2) 
Group A, B: most ER sessions happened before (B), after (A) passing the prior non-replay level. Group SL, DLSO, DO: most 
ER followed pass attempts on the same level(SL), different level in same objective(DLSO), or different objective (DO) 


5 columns of Table 4 shows the results of Mann-Whitney U 
tests with Benjamini-Hochberg correction to compare each 
group in-game performance to the students who never elec- 
tively replayed any levels (the Base group). The last column 
compares the averaged ER performance of each group to the 
rest of students who electively replayed. 


Compared to the base group, students for whom most re- 
plays happened before passing the prior non-replay level 
(Group B) and students for whom most replays followed a 
different level on the same objective (Group DLSO) started 
with significantly lower pre-test scores and did worse in game- 
play, as measured by the three pass attempt features de- 
scribed in section 3.3.2. For example, students in Group 
B started with lower accuracy and confidence at pre-test, 
took an average 0.5 more attempts to pass a level, and had 
lower performance on the Ist pass attempt and all pass at- 
tempts (including the Ist). It seems that Group B students 
who replayed earlier levels before passing the current one 
had less prior knowledge, and struggled more in the game. 
By contrast, students in Group A, for whom most replay 
happened after passing the current level, did slightly bet- 
ter in gameplay compared to students who never electively 
replayed (the Base group). Because these students started 
with pre-test scores that were not statistically significantly 
different from the base group, their replay patterns are as- 
sociated with higher gameplay performance. 


4.2 What Gets Replayed, and When? 


Next, we studied what levels get replayed, and under what 
circumstances. We used a decision tree classifier which al- 
lowed us to identify which factors are most important in 
relative to ER. Our goal was not to find precise predictive 
models, but to augment our understanding of performance 
and its relationship to ER. We used R’s rpart package with 
parameters minsplit=5% and cp=0.02 to build trees to clas- 
sify levels that were replayed from levels that were not re- 
played, and levels whose pass attempts were interrupted or 
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followed by replay from levels that were not interrupted or 
followed by replay. We randomly undersampled the major- 
ity class (levels without replay, levels were not interrupted 
or followed by replay), so that each class represented half of 
the observations. We used pass attempt features at the level 
granularity together with pre-test results, objective, and de- 
mographic information to build our tree. We used 10-fold 
cross validation to access the trees’ accuracies. 


Table 5 reports the trees and the importance of the features. 
We found that a student’s performance on a particular level 
influenced whether replay happened during/after the level’s 
pass attempts. For example, a student was more likely to 
replay a different level under the same objective (DLSO) 
if they took more than two attempts to pass the current 
level. ‘This result is related to the previous result in ‘Table 4, 
showing that, at the student level, those with lower game- 
play performance were more likely to replay another level 
under the same objective. 


On the other hand, the objective to which a level belongs 
influences whether or not a level would be ERed. We built 
trees to predict if a level is replayed following the same level 
(same condition of the last row in Table 5, N=1,776), the 
same objective but a different level (N=12,616), or a dif- 
ferent objective (N=31,852). For all three conditions, the 
trees only contains a single node — objective, with accuracy 
of 55.2%, 62.0%, and 66.9% respectively. This ER decision 
could have been influenced by either the content or timing 
of the objectives. In our tree node, we noticed that many 
objectives with a higher chance of ER occurred earlier in 
the curriculum, this could be because students had more 
time in which these objectives were available for ER. Our 
tree model also had only 55.2% accuracy when predicting 
whether a level would be ERed following the pass attempts 
of itself. One explanation is that we do not have puzzle 
granularity data on how many lives a student actually lost. 
From prior literature [4] [7], students may replay the same 
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Table 5: Decision Trees to Predict Levels whose Pass 
Attempts were Interrupted or Followed by ER 


Condition: inter- ‘Trees 


rupted/followed by 


ER from a different 
level in the same ob- 
jective (N=8,094) 


77.8% accuracy 
#tpass attempts < 2.5, No 
#tpass attempts > 2.5, Yes 


78.7% accuracy 

lst attempt performance > 0.94 
-objective group A, No 
-objective group B, Yes 

lst attempt performance < 0.94 
-objective group A 

—# pass attempts < 6.5, No 
— pass attempts > 6.5, Yes 
-objective group B, Yes 


ER from a different 
objective (N=12,506) 


55.2% accuracy 
objective group A, No 
objective group B, Yes 


ER on the same level 
(N=1,766) 


Note. ‘Trees are presented in text format. For example, the first 


tree shows that if a student passed a level with less than 2.5 pass 
attempts, the tree predicts this student will not replay another 
level during/after this level. 


level following it pass attempts to get a better score, which 
means losing fewer lives (making fewer errors) at a level. As 
shown in Table 4, Group SL students who performed most 
of their ERs after the same level also achieved the highest 
ER performance. 


4.3. Is Elective Replay Associated with Gains? 


In this section we will address our second research question. 
As part of our analysis we considered three gain scores: ac- 
curacy gain, confidence gain, and math gain. The first two 
were measured by in-game pre- and post-tests. Recall that 
both before and after a student attempts an objective, ST 
Math logs the students’ correctness and confidence scores 
on each question on the pre- and post-tests. We averaged 
these scores across the pre- and post-test questions to com- 
pute the first two gain scores. These were assessed at the 
objective granularity. Math gain was calculated based upon 
the difference between the students’ state standardized math 
test scores in years 2012 and 2013. ‘This was assessed at the 
student granularity. 


11.8% of the students were excluded from the math gain 
analysis due to missing state math test records. ‘These ex- 
cluded students performed statistically significantly worse in 
the game as measured by the three pass attempt features; 
this implies that we excluded weaker students. 8.5% of the 
objective observations were excluded from the accuracy and 
confidence gain analysis due to missing pre- or post-tests. 
These excluded observations were not statistically signifi- 
cantly different from the rest as measured by pass attempt 
features. The accuracy and confidence gains were signif- 
icantly correlated (r=0.37, p<0.001), but these two gains 
were not strongly correlated with math gain scores at the 
student granularity (r<0.1, p<0.001). Table 6 reports the 
percentage of data points that gained, dropped (mainly for 
avoiding ceiling effect in this data), and did not gain for each 
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2012 Math State Test >=347 
ies - 


Yes 


Average Post-Test Accuracy < 0.7093 
es 
Ye Rte 
™s 
Yo) MrT ial=te (392-728) | 2012 Math State Test>=474 


Ye 


Not Gained (336/499) Average Pass Attempts Performance < 0.8857 


Ye No 


Not Gained (241/359) | Gained (725/1207) 


Figure 2: Decision Tree to Predict Whether a Stu- 
dent will Gain in State Standardized Math Test 


type of gain based on the Marx and Cummings Normaliza- 
tion method [11]. 


Table 6: %Observations with Gains, No Gains, and 
Percentage Dropped for the Three Gains 


Gain Types ER? Gained Dropped No Gain 


Accuracy ER 48.10% 8.60% 37.90% 
(N=75,083) No ER 43.70% 6.10% 36.60% 

onfidence ER 28.30% 42.60% 23.10% 
(N=75,083) No ER 26.40% 37.40% 22.70% 
Math ‘Test ER 41.60% 0.407% 46.90'% 
(N=4,827) No ER 40.80% 0.50% 45.70% 


Note. 1)Observations in the ’Dropped’ column (pre- and post- 
tests were both 0 or 1) were excluded from analysis. 2)Accu- 
racy and Confidence Gains were measured at objective granular- 
ity, Math gain was measured at student granularity. 3)ER and no 
ER were collapsed across level. 


We first constructed decision trees to partition our data to 
see which factors influence gains, using the method described 
in the prior section. No sampling was necessary because the 
groups had similar sizes. We used pass attempt features, 
ER features, pre-test results, and demographics. For stu- 
dent granularity, we also added the percentage of required 
objectives attempted by the student. 


At the objective granularity, we found that pre-test accuracy 
and confidence were the only selected nodes that predicted 
accuracy (70.0% accuracy) and confidence gain (74.1% accu- 
racy). Students with a pre-test accuracy of < 0.71 (at least 2 
questions wrong out of 5-10) had a 64.7% chance of positive 
accuracy gain in the same objective, while the remainder of 
the students had only a 25.9% chance. Students with high 
pre-test confidence (<0.95, indicated confidence on almost 
all questions) had a 62.5% chance of positive confidence gain 
in the same objective. It could be that these in-game tests 
were too easy, as 18.9% of pretests achieved full scores in 
accuracy and 54.5% achieved full scores in confidence. 


Our decision tree for the student granularity is shown in 
Figure 2, with a cross-validated accuracy of 57.8%. Stu- 
dents who started with medium level of math abilities (2012 
state test math scores <474, and > 347) improved their 
scores when they performed well in ST Math (average pass 
attempts performance > 0.8857). This shows that the game- 
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play data in ST Math has predictive power for assessment 
outside of the game. However, for all three gain scores, the 
ER features were not selected for inclusion in the decision 
tree nor was any correlation found with the students gains. 


Table 7: Mann-Whitney U Tests Comparing Gains 
between ER Pattern Student Groups. 


Group (# | Math Accuracy | Confidence 
students) (max=600)| (max=1) | (max=1) 
Base:No ER M=31.5 M=0.31 M=0.33 
(N=1938) SD=146.6 SD=0.25 SD=0.38 
M=27.3 M=0.30 M=0.32 
a SD=139.7 SD=0.25 SD=0.37 
Group A M=53.4 wi=0136 +M=0.38 
(N=1114) SD=167.9 SD=0.24 SD=0.36 
Group B +M=6.7 M=0.24 M=0.26 
ON) sp=0.95 | $b=0.37 
Group SL M=46.2 M=0.31 M=0.31 
(N=173) SD=161.2 SD=0.28 SD=0.37 
Group DLSO | M=21.4 *M=0.25 *M=0.27 
(N=983) SD=0.26 SD=0.37 
Group DO M=32.3 M=0.32 M=0.34 
(N=1399) SD=150.6 SD=0.23 SD=0.36 


Note. green and red indicate statistically significances higher 
and lower than the base class, with *p < .001, +p < .01 


Finally, we investigated how ER patterns relate to gains. 
Table 7 reports the result from separating students into 6 
groups based on ER patterns and conducting Mann-Whitney 
U tests with Benjamini-Hochberg correction (as in the previ- 
ous section). Moreover, although decision trees constructed 
from the complete dataset show that low pre-test results 
led to more gains, some ER pattern groups showed opposite 
trends. For example, Group B, who primarily ERed before 
passing the current level, started with lower pre-test scores, 
did worse in the game, and had less gains, which were sta- 
tistically significant, in all three gain measures. The same 
applies to Group DLSO. These two groups of students also 
had the lowest ER performance. 


On the other hand, the Base group and Group A (who 
mostly ERed after passing the current level) started with 
pre-test accuracy and confidence scores that are not signif- 
icantly different (Table 4), but Group A did significantly 
better in game, and had higher gains in accuracy and confi- 
dence, which were statistically significant. Because the mean 
pre-test score for the Base and A groups is approximately 
0.6, these students were reasonably familiar with the objec- 
tive before they began playing it. The difference in accuracy 
and confidence gains suggest that ER after students success- 
fully pass a level helped students learn, or implied better 
learning in the previous gameplay. 


5. DISCUSSION AND CONCLUSIONS 


This work presents a significant extension on prior studies of 
replay which have typically taken place over a short period of 
time and have assessed replay via intentional questionnaires 
not observed behaviors [14, 5]. This work analyzed logged 
student-initiated elective replay from a sample of 4,827 3rd- 
Sth graders during school year 2012-2013 in ST Math in 
a natural educational setting. We sought to answer three 
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research questions: Q1: What are the characteristics of stu- 
dents who electively replay? Q2: What gets replayed, and 
under what circumstances? And Q3: Is elective replay as- 
sociated with improvements in students’ accuracy on math 
objectives, confidence, and general math ability? 


We concluded that, with over half of students who electively 
replayed at least one level, ER is a common behavior in ST 
Math. Moreover, examining elective replay can enhance our 
understanding about how students play and the character- 
istics of successful play. For example, we found that stu- 
dents who did poorly on the current level were more likely 
to electively replay a different level during/after the level’s 
pass attempts. We also found that students who generally 
engaged in elective replay before passing the current level 
(Group B) started with lower pre-test scores, did worse dur- 
ing gameplay, and had the lowest objective-level accuracy 
and confidence gain and math gains. One explanation for 
this result is that weaker students used ER as a work avoid- 
ance tactic, as found in Mostow et al. [12], and that in- 
stances of ER stand in for lower motivation or engagement 
for the objective topic, ST Math, or mathematics overall. 


On the other hand, compared to students who didn’t ER, 
students who mostly electively replayed after passing the 
current level (Group A) started with pre-test scores that 
were not significantly different, did better in the game, and 
had higher learning and confidence gains. One reason could 
be that these students electively replayed for a better score, 
as we also found that students who mostly replayed the 
same level immediately after passing it (Group SL) had the 
highest ER performance. This association is especially true 
among achiever-type players [3] that prefer to gain concrete 
measurements of success. Because losing fewer lives in ST 
Math requires better mastery of the math content, ER may 
have helped these students learn. Another explanation is 
that these students’ ERs could imply better learning during 
prior gameplay, as Table 4 also shows that Group A students 
had better pass attempt performance. Possibly, successful 
prior performance motivated these students to electively re- 
play more of the game. Moreover, because successful prior 
performance feeds self-efficacy [2, 13], confidence gains in 
Group A students, who chose more ER, may be linked to 
electively replaying levels they have already mastered. 


From the application perspective, as expected from this com- 
plex environment, our effect-sizes are too small to claim ER 
itself as a powerful intervention for learning. Instead, our 
findings suggest the potential of using ER patterns to iden- 
tify weaker students and their struggling moments for inter- 
vention. For example, students with Group B ER patterns 
started weaker, did poorly in the game, and had lower gains 
in learning, confidence, and math state test scores. It may 
be the case that Group B ER (before passing a level) is a 
signal that students are struggling in current content and 
are in need of a mental break [17] or help. If this is the case, 
it would be beneficial upon detecting these ER patterns for 
ST Math to alert teachers or to provide interventions, such 
as suggesting the student to take a break or providing sup- 
plemental resources to further explain the math concepts 
from the pass attempts interrupted by ER. Our results also 
suggest avenues for experimental studies that designs a more 
effective ER experience, such as preventing work-avoidance 
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in ER. For example, changing the number of lives students 
have at each replay, or constraining the problems offered 
each time they are replayed to be isomorphic but not iden- 
tical. 


This work has several limitations. First, the in-game pre- 
post- tests may be too easy for students, as 18.9% of pretests 
achieved a full score in accuracy, and 54.5% achieved a full 
score in confidence. The high percentage of students with 
non-positive learning and accuracy gain could also be caused 
by students’ slipping or guessing in multiple-choice questions 
(e.g., 1 incorrect answer reduces accuracy by 14%-20%). The 
accuracy of the pre- and post-test questions for assessing 
knowledge might be improved by using short answer ques- 
tions. ‘The second limitation is that we did not have puzzle 
granularity data on how many lives a student actually lost 
or the types of errors they made. Third, the grouping of stu- 
dents based on the majority of elective replay assumes that 
elective replay is a habitual and consistent behavior. Future 
research should investigate other groupings, as well as ex- 
amining whether there were changes in how students used 
replay, and what caused the changes. Fourth, future work 
may also include creating quantified features to compare the 
content and game features across objectives so we may bet- 
ter understand how the game’s content influence students’ 
decision to engage in elective replay. 


In summary, this work adds new insights to our understand- 
ing of elective replay in educational games. Our work reveals 
differential associations between elective replay and perfor- 
mance when replay is categorized by the timing in relation to 
the student’s current learning objectives and gameplay. Our 
work suggests that low-performing students did not benefit 
from ER; high-performing students both chose ER at better 
times and their ERs were associated with benefits from ei- 
ther ER or previous gameplay, which supports the results of 
prior self-regulation research by Aleven et al [1]. This work 
presents prospects for both examining more detailed charac- 
teristics of replay and utilizing experimental manipulations. 
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ABSTRACT 


There is a critical need to develop new educational technol- 
ogy applications that analyze the data collected by univer- 
sities to ensure that students graduate in a timely fashion 
(4 to 6 years); and they are well prepared for jobs in their 
respective fields of study. In this paper, we present a novel 
approach for analyzing historical educational records from 
a large, public university to perform next-term grade pre- 
diction; i.e., to estimate the grades that a student will get 
in a course that he/she will enroll in the next term. Accu- 
rate next-term grade prediction holds the promise for bet- 
ter student degree planning, personalized advising and au- 
tomated interventions to ensure that students stay on track 
in their chosen degree program and graduate on time. We 
present a factorization-based approach called Matrix Factor- 
ization with Temporal Course-wise Influence that incorpo- 
rates course-wise influence effects and temporal effects for 
grade prediction. In this model, students and courses are 
represented in a latent “knowledge” space. The grade of a 
student on a course is modeled as the similarity of their la- 
tent representation in the “knowledge” space. Course-wise 
influence is considered as an additional factor in the grade 
prediction. Our experimental results show that the proposed 
method outperforms several baseline approaches and infer 
meaningful patterns between pairs of courses within aca- 
demic programs. 


Keywords 
next-term grade prediction, course-wise influence, temporal 
effect, latent factor 


1, INTRODUCTION 


Data analytics is at the forefront of innovation in several 
of today’s popular Educational Technologies (EdTech) [17]. 
Currently, one of the grand challenges facing higher educa- 
tion is the problem of student retention and graduation [19]. 
There is a critical need to develop new Ed'Tech applications 


that analyze the data collected by universities to ensure that 
students graduate in a timely fashion (4 to 6 years), and they 
are well prepared for jobs in their respective fields of study. 
To this end, several universities deploy a suite of software 
and tools. For example, degree planners * assist students 
in deciding their majors or fields of study, choosing the se- 
quence of courses within their chosen major and providing 
advice for achieving career and learning objectives. Early 
warning systems [27] inform advisors/students of progress, 
and additionally provide cues for intervention when students 
are at the risk of failing one or more courses and dropping 
out of their program of study. In this work, we focus on the 
problem of next-term grade prediction where the goal is to 
predict the grade that a student is expected to obtain in a 
course that he/she may enroll in the next term (future). 


In the past few years, several algorithms have been devel- 
oped to analyze educational data, including Matrix Factor- 
ization (MF) algorithms inspired from recommender system 
research. MF methods decompose the student-course (or 
student-task) grade matrix into two low-rank matrices, and 
then the prediction of the grade for a student on an untaken 
course is calculated as the product of the corresponding vec- 
tors in the two decomposed matrices [22, 11]. Traditional 
MF algorithms have shown a strong ability to deal with 
sparse datasets [14] and their extensions have incorporated 
temporal and dynamic information [12]. In our setting, we 
consider that a student’s knowledge is continuously being 
enriched while taking a sequence of courses; and it is im- 
portant to incorporate this dynamic influence of sequential 
courses within our models. Therefore, we present a novel 
approach referred as Matrix Factorization with Temporal 
Course-wise Influence (MF TCI) model to predict next term 
student grades. MF'TCI considers that a student’s grade on 
a certain course is determined by two components: (i) the 
student’s competence with respect to each course’s topics, 
content and requirement, etc., and (ii) student’s previous 
performance over other courses. We performed a compre- 
hensive set of experiments on various datasets. The experi- 
mental results show that the proposed method outperforms 
several state-of-the-art methods. ‘The main contributions of 
our work in this paper are as follows: 


1. We model and incorporate temporal course-wise in- 
fluence in addition to matrix factorization for grade 
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prediction. Our experimental results demonstrate sig- 
nificant improvement from course-wise influence. 


2. Our model successfully captures meaningful course- 
wise influences which correlate to the course content. 


3. The learned influences between pairs of courses help 
in understanding pre-requisite structures within pro- 
grams and tuning academic program chains. 


2. RELATED WORK 


Over the past few years, several methods have been de- 
veloped to model student behavior and academic perfor- 
mance [2, 9], and they gain improvement of learning out- 
comes [21]. Methods influenced by Recommender System 
(RS) research [1], including Collaborative Filtering (CF) [18] 
and Matrix Factorization [13], have attracted increasing at- 
tention in educational mining applications which relate to 
student grade prediction [32] and in-class assessment pre- 
diction [8]. Sweeney et. al. [31, 30] performed an exten- 
sive study of several recommender system approaches in- 
cluding SVD, SVD-kKNN and Factorization Machine (FM) to 
predict next-term grade performance. Inspired by content- 
based recommendation [20] approaches, Polyzou et. al. [23] 
addressed the future course grade prediction problem with 
three approaches: course-specific regression, student-specific 
regression and course-specific matrix factorization. More- 
over, neighborhood-based CF approaches [25, 4, 6] predict 
grades based on the student similarities, i.e., they first iden- 
tify similar students and use their grades to estimate the 
grades of the students with similar profiles. 


In order to capture the changing of user dynamics over time 
in RS, various dynamic models have been developed. Many 
of such models are based on Matrix Factorization and state 
space models. Sun et. al. [28, 29] model user preference 
change using a state space model on latent user factors, and 
estimate user factors over time using noncausal Kalman fil- 
ters. Similarly, Chua et.al. [5] apply Linear Dynamical Sys- 
tems (LDS) on Non-negative Matrix Factorization (NMF) 
to model user dynamics. Ju et. al. [12] encapsulate the 
temporal relationships within a Non-negative matrix for- 
mulation. Zhang et. al. [34] learn an explicit transition 
matrix over the latent factors for each user, and estimate 
the user and item latent factors and the transition matri- 
ces within a Bayesian framework. Other popular methods 
for dynamic modeling include time-weighting similarity de- 
caying [7], tensor factorization [33] and point processes [16]. 
The method proposed in this paper tackle the challenges of 
next-term grade prediction which relates to the evolvement 
of student knowledge over taking a sequence of courses. Our 
key contribution involves how we incorporate the temporal 
course-wise relationships within a MF approach. Addition- 
ally, the proposed approach learns pairwise relationships be- 
tween courses that can help in understanding pre-requisite 
structures within programs and tuning academic program 
chains. 


3. PRELIMINARIES 


3.1 Problem Statement and Notations 

Formally, student-course grades will be represented by a se- 
ries of matrices {G1, Go, ..., Gr} for T terms. Each row 
of G; represents a student, each column of G; represents a 
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course, and each value in G;, denoted as OL represents a 
erade that student s got on course c in term t (gs. € (0, 4], 
Isc = 0 indicates that student s did not take the course c in 
term t. We add a small value to failing grade to distinguish 
0 score from such situation.). Student-course grades up to 
the ti, term will be represented by Gt=>~_,G; with size 
of n x m, where n is the number of students and m™ is the 
number of courses. Given the database of (student, course, 
grade) up to term (T' — 1) (i.e., G’~"), the next-term grade 
prediction problem is to predict grades for each student on 
courses they might enroll in the next term T.. To simplify 
the notations, if not specifically stated in this paper, we will 
use gs,c to denote Is.c- Our testing set is then (student, 
course, grade) triples in the 7;;, term, represented by matrix 
Gr. Rows from the grade matrices representing a student s 
will simply be represented as G(s,:) and the specific courses 
that student has a grade for in this row can be given by 


¢ &G(s,:). 


In this paper, all vectors (e.g., uy and v-) are represented 
by bold lower-case letters and all matrices (e.g., A) are rep- 
resented by upper-case letters. Column vectors are repre- 
sented by having the transpose supscript', otherwise by de- 
fault they are row vectors. A predicted/approximated value 
is denoted by having a ~ head. 


4. METHODS 


4.1 MF with Temporal Course-wise Influence 
We consider the student s’ grade on a certain course c, de- 
noted as gs,-, as determined by two factors. The first factor 
is the student s’ competence with respect to the course c’s 
topics, content and requirement. ‘This is modeled through 
a latent factor model, in which s’ competence is captured 
using a size-k latent factor us, c’s topics and contents are 
captured using a size-k latent factor v- in the same latent 
space as u,. Then the competence of s over c is modeled 
by the “similarity” between u, and v- via their dot product 
(i.e., Ug Ve). 


The second factor is the previous performance of student s 
over other courses. We hypothesize that if course c’ has a 
positive influence on course c, and student s achieved a high 
grade on c’, then s tends to have a high grade on c. Under 
this hypothesis, we model this second factor as a product 
between the performance of student on a previous “related” 
course where the pairwise course relationships are learned 
in our formulation. Note that we consider this pairwise 
course influence as time independent, i.e., the influence of 
one course over another does not change over time. How- 
ever, the impact from previous performance/grades can be 
modeled using a decay function over time. Taking these two 
factors, the estimated grade is given as follows: 


~ era 
gs,c — Ug Ve 
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in which A(c’,c) is the influence of c’ onc, Gr_—1(s,:)/Gr-_2(s, 
) is the subset of courses out of all courses that s has taken in 
the first/second previous terms, |Gr_1(s,:)|/|Gr—2(s,:)| is 
the number of such taken courses. e °/e 7% denote the 
time-decay factors. In Equation 1, we consider previous 
two terms. More previous terms can be included with even 
stronger time-decay factors. Given the grade estimation as 
in Equation 1, we formulate the grade prediction problem 
for term TJ’ as the following optimization problem, 
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s.t., A> 0 


where U and V are the latent non-negative student factors 
and course factors, respectively; ||Al|. is the nuclear norm 
of A, which will induce an A of low rank; and ||A||c, is the 
€; norm of A, which will introduce sparsity in A. In addi- 
tion, the non-negativity constraint on A is to enforce only 
positive influence across courses. 


4.1.1 Optimization Algorithm of MFTCI 
We apply the ADMM [3] technique for Equation 2 by refor- 
mulating the optimization problem as follows, 
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min 
U,V,A,U1,U2,21,Z2 


s.t., A>0 


where Z; and Zo are two auxiliary variables, and U, and U2 
are two dual variables. All the variables are solved via an 
alternating approach as follows. 


Step 1: Update U and V. Fixing all the other variables and 
solving for U and V, the problem becomes a classical matrix 


factorization problem: 


min 5 Ue u. Ve) 


U,V 


+ 3 Die + > lives) (2) 


where fs¢ = gs,. - A(T — 1) — A(T — 2) (See Eq 1). The 
matrix factorization problem can be solved using alternating 
minimization. 


Step 2: Update A. Fixing all the other variables and solv- 
ing for A, the problem becomes 


1 2 
on 5) S. Gaz = Guee 


+p(tr(Uy (A — Z1))) + p(tr(Uz (A — Z2))) 
st., A>O 
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Using the gradient descent, the elements in A can be up- 
dated as follows. 


A(ci, cj) = A(ci,c;) — Ir x [p(A(ci, 7) — 21 (ci, ¢3)) 
+ p(A(ci,e;) — Za(ci,e5)) + PUI (ci, c5) + PU2(Ci, cj) 
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(if c; is taken in term T — 1) 
(if c; is taken in term T — 2)| 
(3) 


with projection into [0, +00), where Ir is a learning rate. 


Step 3: Update Z, and Zz. For Z,, the problem becomes 


min T||Z1 ||. 
Z1 


ite EIA — Z| + p(tr(U] (A — Z1))) (4) 


The closed-form solution of this problem is 
ZL = S2(A+U1) (5) 


where Sq(X) is a soft-thresholding function that shrinks the 
singular values of X with a threshold a, that is, 


Sa(X) = Udiag(( — a)4)V" (6) 


where X = USV' is the singular value decomposition of X, 
and 


(x)+ = max(z,0). (7) 


For Z2, the problem becomes 


min Al[Zelle, + FIA — Zal|z + p(tr(Uz )(A—Z2)) (8) 


The closed-form solution is 
Z2 = Ex(A+ U2) (9) 
P 


where E,(X ) is a soft-thresholding function that shrinks the 
values in X with a threshold a, that is, 


Eo(X) = (X — a,0)+ (10) 


where ()+ is defined as in Equation 7. 


Step 4: Update U, and U2. U; and U2 are updated based 
on standard ADMM updates: 


Uy =U,+(A- 7%); Uz = U2+ (A — Za) (11) 


In addition, we conduct computational complexity analysis 
of MFTCI and put it in Appendix. 


5. EXPERIMENTS 
5.1 Dataset Description 


We evaluated our method on student grade records obtained 
from George Mason University (GMU) from Fall 2009 to 
Spring 2016. This period included data for 23,013 transfer 
students and 20,086 first-time freshmen (non-transfer i.e., 
students who begin their study at GMU) across 151 majors 
enrolled in 4,654 courses. 


Specifically, we extracted data for six large and diverse ma- 
jors for both non-transfer and transfer students. These ma- 
jors include: (i) Applied Information Technology (AIT), (ii) 
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Table 1: Dataset Descriptions 


Non-‘Transfer Students 
Major — 


Transfer Students 


AIT 535,739 165 14,396 
BIOL 990 33,527 833 22,691 
CEIE 642 9,812 305 4,538 
CPE 649 7,710 219 1,614 

CS 818 18,376 464 7,967 
PSYC 874 22,598 788 24,661 
Total T614 1,019 75,867 


#5, #C and #S-C are number of students, courses and student-course 
pairs in educational records across the 6 majors from Fall 2009 to 


Spring 2016, respectively. 


Fall 2009 to Fall 2015 


Fall 2009 to Spring 2015 


Training set: 


Test set: at 


Fall 2009 to Fall 2014 


Figure 1: Different Experimental Protocols 


Biology (BIOL), (iii) Civil, Environmental and Infrastruc- 
ture Engineering (CEIE), (iv) Computer Engineering (CPE) 
(v) Computer Science (CS) and (vi) Psychology (PSYC). 
Table 1 provides more information about these datasets. 


5.2 Experimental Protocol 

To assess the performance of our next-term grade prediction 
models, we trained our models on data up to term T'— 1 
and make predictions for term TJ’. We evaluate our method 
for three test terms, i.e., Spring 2016, Fall 2015 and Spring 
2015. As an example, for evaluating predictions for term 
Fall 2015, data from Fall 2009 to Spring 2015 is considered 
as training data and data from Fall 2015 is testing data. 
datasets. Figure 1 shows the three different train-test splits. 


5.3 Evaluation Metrics 

We use Root Mean Squared Error (RMSE) and Mean 
Absolute Error (MAE) as metrics for evaluation, and are 
defined as follows: 


Dees (9s,c _ Gaye)? 


RMSE — ’ 
IGr| 

MAR = 2292eGr lsc ~ Geel 
IGr 


where gs,c and gs,- are the ground truth and predicted grade 
for student s on course c, and Gr is the testing set of (stu- 
dent, course, grade) triples in the 7%, term. Normally, in 
next-term grade prediction problem, MAE is more intuitive 
than RMSE since MAE is a straightforward method which 
calculates the deviation of errors directly while RMSE has 
implications such as penalizing large errors more. 


For our dataset, a student’s grade can be a letter grade (i.e. 
A, A-,..., F). As done previously by Polyzou et. al. [24] we 


define a tick to denote the difference between two consecu- 
tive letter grades (e.g., C+ vs C or C vs C-). To assess the 
performance of our grade prediction method, we convert the 
predicted grades into their closest letter grades and com- 
pute the percentage of predicted grades with no error (or 
0-ticks), within 1-tick and within 2-ticks denoted by Pcto, 
Pct; and Pct2z, respectively. For the problem of course se- 
lection and degree planning, courses predicted within 2 ticks 
can be considered sufficiently correct. We name these met- 
rics as Percentage of Tick Accuracy (PTA). 


5.4 Baseline Methods 


We compare the performance of our proposed method to the 
following baseline approaches. 


5.4.1 Matrix Factorization 

Matrix factorization is known to be successful in predict- 
ing ratings accurately in recommender systems [26]. This 
approach can be applied directly on next-term grade predic- 
tion problem by considering student-course grade matrix as 
a user-item rating matrix in recommender systems. Based 
on the assumption that each course and student can be rep- 
resented in the same low-dimensional space, corresponding 
to the knowledge space, two low-rank matrices containing 
latent factors are learned to represent courses and students 
[30]. Specifically, the grade a student s will achieve on a 
course c is predicted as follows: 


Cre = U+ ps +qetulv, (12) 


where yu is a global bias term, ps, (p € R”) and q (q € 
R'™) are the student and course bias terms (in this case, for 
student s and course c), respectively, and u, (U € R**”) 
and v. (V € R**™) are the latent factors for student s and 
course c, respectively. 


5.4.2 Matrix Factorization without Bias (MF) 

We only considered the student and course latent factors to 
predict the next-term grades. Therefore, the grade a student 
s will achieve on a course c is calculated as follows: 


Js,c = UsVe (13) 


5.4.3 Non-negative Matrix Factorization (NMF) [15] 


We add non-negative constraints on matrix U and matrix V 
in Equation 13. The non-negativity constraints allows MF 
approaches to have better interpretability and accuracy for 
non-negative data [10]. 


6. RESULTS AND DISCUSSION 


6.1 Overall Performance 

Table 2 presents the comparison of Pcto, Pct; and Pct2 for 
non-transfer students for the three terms considered as test: 
Spring 2016, Fall 2015 and Spring 2015. We observe that the 
MF TCI model outperforms the baselines across the different 
test sets. On average, MFTCI outperforms the MF, MF» 
and NMF methods by 34.18%, 11.59% and 4.08% in terms of 
Pcto, 16.64%, 7.96% and 4.03% in terms of Pct1, and 2.10%, 
3.00% and 1.98% in terms of Pctz, respectively. We observe 
similar results for transfer students as well (not included 
here for brevity). 
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Table 2: Comparison Performance with PTA (%) 


Spring 2016 Fall 2015 
— 
27.71 98.02] 12.05 26.63 58.89 


MF] 13.25 
MFo| 16.52 31.65 57.46 
NMF| 13.21 27.04 57.18 


13.03 26.09 54.83 


15.51 30.03 55.64} 15.53 29.53 54.94 
15.33 30.12 56.15] 15.56 29.23 54.93 


i) “t” indicates the higher the better. ii) Reported values of Pcto, Pct; and Pctzg are percent- 
ages. iii) Best performing methods are highlighted with bold. 


Table 3 presents the performance of the baselines and MFTCI 


model for the three different terms of both non-transfer and 
transfer students using RMSE and MAE as evaluation met- 
rics. The MFTCI model consistently outperforms the base- 
lines across the different datasets in terms of MAE. In ad- 
dition, the results shows that MF'o, NMF and MFTCI tend 
to have better performance for Spring 2016 term than Fall 
2015 term. Similar trend is observed between Fall 2015 term 
and Spring 2015 term. This suggests that MFTCI is likely 
to have better performance with more information in the 
training set. 


6.2 Analysis on Individual Majors 

We divide non-transfer students based on their majors and 
test the baselines and MFTCI model on each major, sep- 
arately. Table 4 shows the comparison of Pcto, Pct: and 
Pct2 on different majors. The results show that MF'TCI has 
the best performance for almost all the majors. Among all 
the results, MFTCI has the highest accuracy when predict- 
ing grades for PSYC and BIOL students for which we have 
more student-course pairs in the training set. 


6.3. Effects from Previous Terms on MFTCI 
In order to see the influence of number of previous terms 
considered in MFTCI, we run our model with only A(T — 1) 
in Equation 1. This method is represented as MFTCI)1. 
Figure 2 shows the comparison results of MAE for six sub- 
sets of data which are reported in Table 3, where “NTR” 
stands for non-transfer students and “TR” stands for trans- 
fer students. The results show that MFTCI consistently 
outperforms MFTCI,1 on all datasets. This suggests that 
considering two previous terms is necessary for achieving 
good prediciton results. Moreover, since we consider that 
the student’s knowledge is modeled using an exponential 
decaying function over time, we do not include the influence 
from the third previous term in our model as its influence 
for the grade prediction is negligible in comparison to the 
previous two terms. 


6.4 Visualization of Course Influence 

To interpret what is captured in the course influence matrix 
A (See Eq 1), we extract the top 20 values with the corre- 
sponding course names (and topics) for analysis. Figure 3 
and 4 show the captured pairwise course influences for CS 
and AIT majors, respectively. Each node corresponds to 
one course which is represented by the shortened course’s 
name. We can notice from the figures that most influences 
reflect content dependency between courses. For example, 
in the CS major, “Object Oriented Programming” course 
has significant influence on performance of “Low-Level Pro- 


a= MFTCI,, 
0.68] c= MFTCI |, 


Lu 
< 
= 
0.64} | 
0.621 | 
0.60 L 1 ‘ : 
NTR Spring NTR Fall NTRSpring TR Spring TRFall TR Spring 
2016 2015 2015 2016 2015 2015 
Figure 2: Comparison performance for MFTCI,1 and 
MFTCI 


gramming” course (the former one is also the latter one’s 
prerequisite course); “Linear Algebra” and “Discrete Math- 
ematics” have influence on each other; “Formal Methods & 
Models” course has influence on “Analysis of Algorithms” 
course. In case of the AIT major, both “Introductory IT” 
course and “Introductory Computing” course have influence 
on “IT Problem & Programming” course; “Multimedia & 
Web Design” course has influence on both “Applied IT Pro- 
gramming” course and “IT in the Global Economy” course. 
GMU has a sample schedule of eight-term courses for each 
major in order to guide undergraduate students to finish 
their study step by step based on the level, content and 
difficulty of courses 7. Among the identified relationships 
shown in Figures 3 and 4 we found 17 and 13 of the CS and 
AIT courses influences in the guide map, respectively. The 
rest of the identified influences are among other general elec- 
tives but required courses (e.g., “Public Speaking” course), 
or specific electives pertaining to the major (e.g., “Research 
Methods” course). This shows that our model learns mean- 
ingful course-wise influences and successfully uses it to im- 
prove MF model. 


Figure 5 shows the identified course influences for the BIOL, 
CEIE, CPE and PSYC majors. These identified course-wise 
influences seem to capture similarity of course content. 


7. CONCLUSION AND FUTURE WORK 


We presented a Matrix Factorization with Temporal Course- 
wise Influence (MFTCI) model that integrates factorization 
models and the influence of courses taken in the preceding 
terms to predict student grades for the next term. 


We evaluate our model on the student educational records 
from Fall 2009 to Spring 2016 collected from George Ma- 


*http://catalog.gmu.edu 
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Table 3: Comparison Performance with RMSE and MAE. 


Non-Transfer Students 


Transfer Students 


Methods] Spring 2016 Fall 2015 spring 2015 spring 2016 Fall 2015 spring 2015 
RMSE MAE RMSE MAE RMSE MAE |] RMSE MAE RMSE MAE RMSE MAE 
MF | 0.999 0.754 1.037 0.786 1.023 0.784 0.925 0.688 0.921 0.686 0.985 0.732 
MFo | 0.929 0.714 0.977 0.752 1.014 0.778 0.893 0.668 0.944 0.705 1.011 0.765 
NMF} 1.020 0.769 0.967 0.746 1.000 0.771 0.906 0.683 0.932 0.701 0.979 0.746 
MFTCI|0.928 0.685 0.982 0.717 1.012 0.750] 0.887 0.636 0.927 0.662 1.000 0.721 


Object Oriented Programming > 


0.691 


0.4953 


0.3661 Low-Level Programming 


0.4392 


3 


Data Structures 


0.4313 


0.3691 


Digital Electronics Co Formal Methods & Models > 


0.3512 


Analysis of Algorithms 


4929 


Discrete Mathematics 


Research Methods 


Linear Algebra 


0.3526 
0.3646 Reading & Writng 


0.563 


Computer Ethics 


0.536 


Analytic Geometry & Caleulus > 


0.4199 


Public Speaking 


0.6033 


Advanced Composition 


Figure 3: Identified course influences for CS major 


Table 4: Comparison Performance for Different Majors 


AIT BIOL CEIE CPE CS PSYC 
18.00 15.99 12.99 15.98 20.18 
22.10 16.70 14.21 16.47 22.12 
22.16 17.01 14.32 16.61 22.17 
24.24 16.80 14.32 17.32 25.83 
35.43 31.47 27.86 31.53 39.41 
39.68 31.87 27.97 30.51 39.63 
39.74 31.67 27.19 30.43 39.36 
40.87 32.38 27.53 31.78 42.29 
67.78 58.66 52.28 56.91 71.01 
67.54 58.35 50.72 56.24 67.74 
67.54 58.55 51.17 56.17 67.79 
MFTCI] 66.70 68.25 58.76 52.94 58.18 68.29 


son University. The dataset in this study contains both 
non-transfer and transfer students from six different ma- 
jors. Our experimental evaluation shows that MFTCI con- 
sistently outperforms the different state-of-the-art methods. 
Moreover, we analyze the effects from previous terms on 
MFTCI, and we make the conclusion that it is necessary 
to consider two previous terms. In addition, we visualize 
the patterns learned between pairs of courses. The results 
strongly demonstrate that the learned course influences cor- 
relate with the course content within academic programs. 


In the future, we will explore incorporation of additional con- 
straints over the the pairwise course influence matrix, such 
as prerequisite information, compulsory and elective provi- 
sion of a course. We will explore using the course influence 
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information to build a degree planner for future students. 
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APPENDIX 
A. COMPUTATIONAL COMPLEXITY ANAL- 
YSIS 


The computational complexity of MF'TCI is determined by 
the four steps in the alternating approach as described above. 
To update U and V as in Equation 2 using gradient de- 
scent method via alternating minimization, the computa- 
tional complexity is O(niteru,(k xX ns,. +k x m+kxn)) = 
O(niterur,(k xX Ns,c)) (typically ns. > max(m,n)), where ns,c 
is the total number of student-course dyads, n is the num- 
ber of students, m is the number of courses, k is the latent 
dimensions of U and V, and niter,,, is the number of itera- 
tions. To update A as in Equation 3 using gradient descent 
method, the computational complexity is upper-bounded by 
O(nitera (Nee X —*)), where nec is the number of course pairs 
that have been taken by at least one student, “© is the av- 
erage number of students for a course, which upper bounds 
the average number of students who co-take two courses, 
and niter, is the number of iteractions. Essentially, to up- 
date A, we only need to update A(c:,c;) where c; and c; 
have been co-taken by some students. For A(c;,c;) where 
c; and c; have never been taken together, they will remain 
0. To update Z, as in Equation 4, a singular value decom- 
position is involved and thus its computational complexity 
is upper bounded by O(m*). To update Zz as in Equa- 
tion 8, the computational complexity is O(m7). To update 
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Figure 4: Identified course influences for AIT major 
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Figure 5: Identified course influences for different majors 


U, and U2 as in Equation 11, the computational complexity 
is O(m”). Thus, the computational complexity for MTFCI 
is O(niter(niteryy(k X Ms,c) + nitera(Nee X ““£) +m° +m7)) 
= O(niter(niteruy (kX Ns,c) +nitera (Nee x ““£)+m*)), where 
niter is the number of iterations for the four steps. Al- 
though the complexity is dominated by m® due to the SVD 
on A+ Uj, since n (i.e., the number of courses) is typically 
not large, the run time will be more dominated by ng,c (i.e., 


the number of student-course dyads). 
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ABSTRACT 


Expertise in a domain of knowledge is characterized by a greater 
fluency for solving problems within that domain and a greater 
facility for transferring the structure of that knowledge to other 
domains. Deliberate practice and the feedback that takes place 
during practice activities serve as gateways for developing domain 
expertise. However, there is a difficulty in consistently aligning 
feedback about a learner’s practice performance with the intended 
learning outcomes of those activities — especially in situations 
where the person providing feedback is unfamiliar with the 
intention of those activities. To address this problem, we propose 
an intelligent model to automatically label opportunities for 
practice (assessment questions) according to the learning outcomes 
intended by the course designers. As a proof of concept, we used a 
reduced version of Bloom’s Taxonomy to define the intended 
learning outcomes. Using a factorial design, we employed term 
frequency-inverse document frequency (TF-IDF) and _ latent 
Dirichlet allocation (LDA) to transform questions from text to word 
weightages with support vector machine (SVM) and extreme 
learning machine (ELM) to train and automatically label the 
questions. We trained our models with 120 questions labeled by the 
subject matter expert of an undergraduate engineering course. 
Compared to existing works which create models based on a self- 
generated dataset, our proposed approach uses 30 untrained 
questions from online/textbook sources to validate the performance 
of our models. Exhaustive comparison analysis of the testing set 
showed that TF-IDF with ELM _ outperformed the other 
combinations by yielding 0.86 reliability (Fl measure) with the 
subject matter expert. 


Keywords 


Learning outcomes, Term frequency-inverse document frequency, 
Latent Dirichlet allocation, Extreme learning machine, Support 
vector machine 


1. INTRODUCTION 


Increasingly, modern curriculum design in tertiary and adult 
learning settings has become a collaborative endeavor between 
subject matter experts, learning designers, and _ learning 
technologists. While these teams employ a variety of process 


Nanyang Technological 


khartman@ntu.edu.sg 


Sivanagaraja Tatinati 
Nanyang Technological 


Andy W. H. Khong 


Nanyang Technological 


University University 
50 Nanyang Ave 50 Nanyang Ave 
Singapore 639798 Singapore 639798 


tatinati@ntu.edu.sg andykhong@ntu.edu.sg 


models for the planning, execution, and revision of their curriculum 
and activity designs, often greater attention is paid to the 
construction of a course design and the course content rather than 
the assessment practices that measure learning and their ongoing 
maintenance. 


The algorithms and use case described in this paper exist in a 
particular context of outcome-based education. In this context, 
learning is defined by observable changes in a learner’s behavior. 
These changes commensurate with Krathwohl’s model of learning 
objectives [1] but learning outcomes go beyond objectives. 
Learning outcomes are predicated on having learners observably 
demonstrate their growing understanding of a topic or proficiency 
within a field [2]. When learning activities become more open- 
ended and exploratory, and when learners are offered choices for 
how to proceed, learners often look to how they will ultimately be 
assessed to gauge which learning strategies they should employ [3]. 


When a course’s learning activities support its assessment practices 
and the assessment practices support the types of outcomes that are 
relevant to learners in the future, the course’s activities and 
intended learning outcomes exhibit constructive alignment with 
each other [2]. Adhering to constructive alignment creates a 
seamless path from learning, to applying, to transferring concepts 
and relationships when solving novel problems. 


However, the promise of constructive alignment is not easily 
delivered upon. Oftentimes, a course’s learning outcomes cannot 
be measured by its assessment practices, or its assessment practices 
are decontextualized from the types of activities and practices 
learners are actually preparing for [4]. Whether in the context of 
higher learning or professional development, when thinking about 
developing flexible, life-long learners it is paramount to have 
mechanisms in place to support learners as they work to gain 
domain expertise. These processes should reliably measure 
learning and link assessment practices to authentic activities. 


1.1 Learning design for domain expertise 

Prior work in designing for adaptive domain expertise, the kind of 
expertise necessary for learners to function in changing 
environments and flexible job scopes, has shown that learning 
design teams need to be cognizant of three elements which will be 
discussed in turn. 


1.1.1 Levels of learning outcomes 

Learning outcomes range in sophistication and vary by field. In 
medicine, Miller’s Pyramid [5] lists learning outcomes beginning 
with knowing about a subject, progressing to knowing how to do 
something, to being able to actually demonstrate it in a contrived 
setting like a role-play with actors, and to being able to demonstrate 
it in areal environment like a surgical theater [6]. The idea is based 
on the belief that the development of expertise is a progression from 


Proceedings of the 10th International Conference on Educational Data Mining 96 


the recall of facts to the execution of skills. However, as research 
on problem based learning has shown, demonstration of skill and 
the recall of facts can proceed independently of each other 
depending on the learning environment [7]. 


In [8], a field agnostic method of classifying learning outcomes 
based on their quality is presented. Essentially, the Structure of 
Observed Learning Outcomes (SOLO) taxonomy identifies the 
level of cognitive sophistication a learning outcome requires. 
Lower level learning outcomes indicate a learner is capable of 
remembering facts in isolation. More sophisticated levels require 
learners to assimilate information from various sources to make 
connections and transform that understanding into something new. 


Perhaps the most popular listing of learning outcomes is Bloom’s 
Taxonomy. Similar to Miller’s Pyramid, Bloom’s Revised 
Taxonomy also begins with the retrieval of facts and information 
as its foundation and builds up to application of knowledge and 
further to analyzing, evaluating, and creating. Because of its 
simplicity and familiarity with learning designers and subject 
matter experts alike, Bloom’s Taxonomy can easily be used to 
identify the levels of learning outcomes in a course [9]. 


1.1.2 Opportunities for deliberate practice 

Along with identifying a learning activity’s intended outcomes, 
expertise development requires opportunities for deliberate 
practice. In contrast to repetitive practice intended for learners to 
develop automaticity in either the recall of information or the 
application of a skill, often during time-limited tasks, deliberate 
practice focuses on mastering the nuances of the domain itself to 
fine-tune performance [10]. In fact, a learner’s level of grit, a 
combination of perseverance and passion, predicts how close to 
expert performance a learner will eventually show [11]. 


The key difference in processes between repetitive practice and 
deliberate practice leads to different forms of expertise: adaptive 
and routine [12]. Routine forms of expertise allow a learner to 
conduct a task at an optimal level. Adaptive expertise allows 
learners to learn new tasks or solve novel problems at an 
accelerated rate. In an industrial setting, routine expertise helps a 
worker complete a particular job function. Adaptive expertise 
enables that same worker to retrain to fill new job functions. 
Typically, the amount of time necessary to achieve expert 
performance in a domain is in the order of years to decades [13]. 
However, incremental improvement can be seen in a few practice 
cycles when activities align to the intended learning outcomes. 


1.1.3 Formative assessments and actionable 
feedback 


Hand in hand with creating opportunities for deliberate practice is 
providing formative feedback to the learner about how to improve 
that practice while that improvement is still relevant. Imagine 
students who diligently answer every question in an engineering 
textbook but never receive feedback on the quality of their 
solutions. In this case, the learners would be unable to gauge their 
performance in relation to the course learning outcomes or have an 
idea about how to improve their performance in the future. Now 
imagine if those same students do receive feedback, but that 
feedback arrives after the course’s final examination. If the content 
of the course is mostly self-contained and will not be revisited, the 
feedback is mostly irrelevant. 


Formative feedback consists of two parts: 1) an interpretable 
indication of a learner’s performance on an assessment of learning 
with respect to a standard of performance (learning outcome) and 


2) the opportunity to improve performance before the final 
evaluation [14]. 


Cognitive tutors provide a clear example of the power of coupling 
formative assessment and actionable feedback together in the 
domain of mathematics learning [15]. By presenting learners with 
a series of structured problems, cognitive tutors are capable of 
intervening at any point during the problem-solving process to 
provide students with feedback about their performance. This 
feedback may be the identification of an error, the presentation of 
a hint, or the request for more information about the learner’s 
reasoning. After the feedback, learners have the opportunity to 
adjust their problem-solving heuristics to improve _ their 
performance going forward. 


Such an interaction sequence works with highly structured tasks 
with application-oriented learning outcomes. However, the 
feedback cycle is more difficult to manage when the learning 
outcomes are aligned to higher-order reasoning like evaluation, 
analyzing and creating. These outcomes have multiple paths for 
reaching a satisfactory answer. 


With this difficulty in mind, we looked at techniques to automate 
the process of identifying the reasoning level of text-based 
assessment items (questions) with the intention of better aligning 
questions to learning outcomes as a first step toward being able to 
provide opportunities for deliberate practice. Subsequently, the 
outcome of our proposed work is to link actionable feedback to a 
learner’s performance on assessment items. 


1.2 Automated question classification 


techniques 

Prior work has shown the viability of automatically labeling 
questions in accordance with a course’s learning outcomes. 
However, our work goes beyond labeling existing content to 
helping course instructors promote deliberate practice and expertise 
development by providing a method of finding new questions that 
align to the course designer’s original intended learning outcomes. 
We highlight the drawbacks of prior work and how our proposed 
approach addresses those limitations. 


1.2.1 Labeling questions based on difficulty level 
Early attempts at automatically labeling questions relied on subject 
matter experts to pre-define the difficulty levels of questions. 
Artificial neural network trained by backpropagation then used the 
question features and assigned difficulty levels in the training set to 
classify new questions. A five-dimensional feature vector that 
consisted of query-text relevance, mean term frequency, length of 
questions and answers, term frequency distribution (variance), 
distribution of questions and answers in a text were used. The 
method yielded an Fl measure, a classification reliability metric 
that measures a test’s accuracy, of 0.78 [16]. However, a major 
pitfall this method is its lack of semantic analysis. 


Entropy-Based Decision Tree has also been used to label questions 
[17]. The weakness in this strategy is that there is high possibility 
of overfitting the model during the training phase that then 
negatively affects the subsequent prediction performance. 


1.2.2 Labeling questions based on Bloom’s 


Taxonomy using Natural Language Processing 

Natural Language Processing (NLP) has been used for the 
generation of assessments, answering questions, supporting users 
in Learning Management Systems and preparing course materials. 
The Wordnet package has been used to detect semantic similarity. 
By performing a rule-based approach, the accuracy of labeling a 
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question based on Bloom’s Taxonomy reaches 82% [18]. To 
improve the rule-based approach, a hybrid technique of using an N- 
gram classifier with a rule-based approach has also been explored. 
Rules were based on combining parts-of-speech tagging, and the 
N-gram classifier found the probabilities of predicting certain 
words. Such a hybrid method yielded an Fl measure of 0.86 [19]. 


1.2.3 Labeling questions based on Bloom’s 


Taxonomy using machine learning techniques 

Machine learning algorithms can be broadly split into either 
supervised or unsupervised training implementations. Generally, 
supervised training is adopted when, during training, labels have 
been pre-determined and questions are labeled by an expert. The 
most commonly used method in such cases is the term frequency- 
inverse document frequency (TF-IDF). The algorithm assigns 
weightages to individual words in a question statement to define a 
custom vector space to each question. 


Machine learning techniques such k-nearest neighbors, Naive 
Bayes and support vector machine (SVM) have been implemented 
for labeling questions. When doing a performance comparison 
among these three techniques, an Fl measure of 0.71 was achieved 
using SVM [20]. To increase the accuracy level, additional features 
were incorporated in future versions of the work. Three different 
feature selection processes, namely: Odd Ratio, Chi-square statistic 
and Mutual Information were used with the three machine learning 
techniques. The Fl measure result reached 0.9 [21]. 


Furthermore, an integrated approach of feature extraction has been 
proposed by using headword, semantic, keyword and syntactic 
extractions, which are fed into SVM [22]. However, this work has 
not yet been completed by using a testing dataset to quantify the 
reliability of prediction. 


A major downside in existing works 1s that both the training as well 
as testing questions are part of the same course curriculum; the 
questions are generated by the same author/instructor. Even when 
a high Fl measure is achieved, it does not enable the algorithm to 
label questions written by another subject matter expert. Our work 
increases the flexibility of labeling methods by testing our models 
with a new set of questions compiled from textbook and online 
resources. 


In addition, our work introduces extreme learning machine (ELM), 
which has been shown to outperform SVM during similar labeling 
tasks [23]. Moreover, we introduce LDA as an alternative technique 
to TF-IDF for transforming question statements into numerical 
word weightages. 


By comparing combinations of these new techniques with more 
traditional techniques, we aim to gauge which combination attains 
the highest labeling reliability with the subject matter expert when 
automatically labeling untrained questions. For our purposes, using 
the combination with the highest Fl measure (fewest false 
negatives and false positives) becomes paramount. In our use case, 
a mislabeling by the algorithm will lead to the wrong set of practice 
questions to be given to students and diminish the impact of 
deliberate practice on reaching the intended learning outcomes. 


2. METHODS 
2.1 Materials 
2.1.1 Labeling scheme 


The core of this study centers on a labeling scheme for identifying 
the sophistication of learning outcomes based on a simplified 
version of Bloom’s Taxonomy. In this labeling scheme, the first 
two levels of Bloom’s Taxonomy (Remembering and 


Understanding) were collapsed into Remember. Applying 
remained its own category. All of the higher-order reasoning 
categories (Analyzing, Evaluating, and Creating) were collapsed 
into Transfer. Figure 1 shows how our labeling scheme categories 
map onto the original categories from Bloom’s Revised Taxonomy. 


Transfer 


Apply 


Understanding 


Figure 1: Mapping of Bloom's Revised Taxonomy [24] 


Remember 


We collapsed the taxonomy into three categories for two reasons. 
First, the subject matter expert tasked with labeling the questions 
was unsure about how reliably the questions could be labeled by 
someone without a background in learning design, educational 
psychology, or curriculum development. Collapsing the categories 
to Remember, Apply, and Transfer made manually labeling 
hundreds of questions to train the machine learning algorithms 
more tractable. Second, collapsing the categories had the effect of 
making Bloom’s Taxonomy more analogous to the successful use 
cases of Miller’s Pyramid by subject matter experts in both higher 
education and professional development settings [5]. 


2.1.2 Question dataset 

The dataset consists of a total of 150 questions used for training and 
testing the machine learning algorithms based on the content of an 
undergraduate electrical and electronic engineering course. 


For this study, we formed a training set of 120 questions by 
randomly selecting 40 Remember, Apply, and Transfer items from 
the larger question pool of more than 200 questions used in that 
course. The pool came from a repository of four years’ worth of 
assignment, homework, quiz and exam questions presented to 
students. These questions prompt students for a range of answer 
types (1.e., open-ended, multiple-choice, short-structured, essay). 


We then created a testing set of 30 new questions compiled from 
external sources such as textbooks and online question banks. This 
set was also balanced with equal representation of Remember, 
Apply, and Transfer questions. 


2.2 Data pre-processing procedures 

We pre-processed the raw questions in two phases. First, the subject 
matter expert labeled every question according to the labeling 
scheme described above. Second, we transformed the text of every 
question into a machine-readable format before passing them 
through the machine learning algorithms. 


2.2.1 Subject matter expert pre-processing 

The subject matter expert manually labeled each question in the 
training set based on its intended learning outcome (Remember, 
Apply or Transfer). The subject matter expert then labeled the 30 
new questions in the testing set in the same manner. These new 
questions are labeled for the purpose of knowing the ground truth 
for performance evaluation. Table | below shows some examples 
of the labeled questions. 
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Table 1 - Examples of labeled questions 


Remember 


Consider a signal described by y[n] = 2n +4. What would be the 
amplitude of the signal at sample index n=3? 


Consider the following input and output signals: find the transfer 
function and state the poles and zeros of this transfer function. 


Transfer 


Describe how the bandpass filter can be utilized for radar 
applications. 


2.2.2 Text pre-processing 

The text transformation began by excising all equations, 
mathematical symbols and diagrams from the questions. We only 
kept the core of the question prompts by removing the descriptive 
and explanatory text from scenario and hypothetical questions. For 
example, if a question began by setting the stage with “Peter has 
been asked to perform...”, followed by the question prompt “How 
much voltage should Peter expect in the circuit?”’, all of the 
descriptive text prior to the question prompt was removed to 
improve the consistency of word length and usage between items. 


For the remaining words in the questions, we changed all of the 
characters to lower case, removed all punctuation marks, numbers, 
and non-unicode characters. We then stemmed the remaining 
words to obtain a list of root words. From this list of root words, we 
removed all words with fewer than three letters. Because we were 
unsure of the relationship between the words and the labels, we did 
not create a list of stopwords for removal. 


3. TECHNIQUES 


We tested four combinations (in no particular order) of word 
weighting and question labeling algorithms, as shown in Figure 2, 
to identify the techniques with the highest reliability for our 
automated learning outcome labeler. 


Question statements 


Figure 2: Four combinations of algorithms 


Every word in each question prompt was assigned a weightage 
value based on either term frequency-inverse document frequency 
(TF-IDF) or latent Dirichlet allocation (LDA). Subsequently, the 
vector values for each question were passed through either support 
vector machine (SVM) or extreme learning machine (ELM) to 
assign a label. All algorithms were implemented in R Studio. 


3.1 Term frequency-inverse document 


frequency 

Term frequency-inverse document frequency (TF-IDF) is a 
technique for finding the relative frequency of words in a given 
document, and comparing those frequencies with the inverse of 
how often each of those words appear in the complete document 
corpus. The resulting ratio can be used to signify the relevance of 
each unique word within a single document. 


We implemented a modified version of TF-IDF that used individual 
questions as the source of the analysis instead of complete 
documents. This focused the model on finding the relevance of each 
word within each single question. By converting each question into 
a vector of weightages based on word frequencies, the machine 
learning algorithms were then used to label the questions. The 
modified TF-IDF model can be described by 

TF — IDF (Wi qx) = #(Wi Ie) X log TGR 


(1) 


where wi refers to a particular word 7, gx refers to a particular 
question k, #(wi,gx) refers to number of times wi occurs in gx, TR 
refers to total number of questions and #TR(wi) refers to question 
frequency, or the number of questions in which w; occurs [20]. 


In the case where the term frequency (TF) count is biased towards 
longer questions, the TF count is normalized as 
TFi x = 


Nik 
Lj Npk 


(2) 


where nix refers to the number of times wi occurs in gz, the 
denominator term (size of each question) refers to the sum of the 
number of times each word appears in gx [25]. 


For our work, the pre-processing procedures registered a total of 
465 unique stemmed words in our compilation of 120 training 
questions and 30 testing questions. This led to each question being 
represented as a vector of 1 row and 465 columns arranged in 
alphabetical order by stemmed word. When a word is present in a 
question, the normalized weight of that word is assigned to that 
question’s vector element. If a word is not present in the question, 
the weight is zero. 


After determining the unique word weightage vectors for all 150 
questions, the entire matrix is sorted such that for each question, the 
weightages are arranged in ascending order. The top ten weightages 
are chosen for each question. The 10 weightages may correspond 
to different words in each question, but their combinations remain 
question-specific and give a numerical representation of each 
question statement. This new vector of 10 columns per question 
serves as the input to the machine learning algorithms. 


As an example, we will use the pre-processed question prompt: 
for signal which begin when the one side unilateral ztransform given 


Table 2 below shows the weightages assigned to the above example 
after the application of the TF-IDF technique. The weightages are 
then arranged in ascending order and the top 10 values are taken. 


Table 2 - TF-IDF weightage arrangement 


when 0.279 
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3.2 Latent Dirichlet allocation 

Latent Dirichlet allocation (LDA) is a probabilistic technique for 
topic modeling based on the Bayesian model. The essential idea of 
LDA is that each document consists of a mixture of topics, with the 
continuous-valued mixture properties distributed in a Dirichlet 
random variable, a continuous multivariate probability distribution. 


Again, in the context of our work, we applied LDA to questions in 
the dataset by substituting the original notion of documents in the 
LDA algorithm with questions in our modified model. Therefore, 
the modified model attempted to find k number of topics (k is a 
user-defined parameter to determine the desired number of topics, 
or dimensionality of the Dirichlet distribution) for a given set of 
question statements based on the choice and usage of words in each 
question. The joint distribution of a topic mixture, a set of topics 
and a set of words can be represented by 


p(6,t,wla, 8) = pla) Titi plOpwilt.6) (3) 
where parameter o is a k-vector with components more than zero, 
parameter B refers to the matrix of word probabilities, 0 refers to a 
k-dimensional Dirichlet random variable, fi refers to a topic, wi 
refers to a word [26]. 


Figure 3 shows a graphical model representation of LDA. The 
bigger circle refers to questions while the smaller circle refers to 
the repeated choice of topics and words within each question. 


S 
o> 
Oy 
B 
4 


Figure 3: Graphical model representation of LDA 


Since LDA involves topic modeling, an appropriate k value chosen 
for our work was ten. This allowed a standard comparison between 
LDA and the top ten weightages from the TF-IDF method. The 
generated unique topics (based on the stemmed words) are shown 
in Table 3. 


Table 3 - Topic names generated by LDA 


ey 
a 
ee 


Out of the entire set of stemmed words detected, ten words have 
been identified as topic names. Hence, LDA automatically 
associates the remaining words the above-mentioned ten topics. 
Based on the words that appear in each question, LDA displays the 
number of topics per question. Based on the topic assignments, the 
topic weightages for each question is generated. For topics not 
present in a question, a minimal weightage is given to those topics 
in lieu of a zero value. The value ensures that the topic weightages 
for a question sum to one. Similar to the TF-IDF output, the new 
vector of 10 columns per question becomes the input for the 
machine learning algorithms. 


3.3. Extreme learning machine 

Extreme learning machine (ELM) is a learning algorithm for 
single-hidden layer feedforward neural networks (SLFNs). ELM 
can be used for classification, regression, clustering, compression 
and feature learning. ELM randomly chooses the hidden nodes and 
determines the output weights of the neural networks. 


The following three-step learning model explains ELM. Given a 
training set that is labeled (information about the target nodes), 
hidden node activation function and number of hidden nodes, 


Step 1: Randomly assign hidden node parameters 
Step 2: Calculate the hidden layer output matrix, H 
Step 3: Calculate the output weight y 


Given a set of inputs with unknown labels, the objective is to find 
the target outputs [27]. Once the inter-layer weights have been 
found, the same weights are used during the testing phase. For a 
given set of input samples xx, the target/output is given by tk. For 
number of hidden nodes LZ and with a certain activation function 
f(x), the SLFN is modeled as 


iat Vj FO) = Vjar vj f(wy xe + b)) = WK =aLWGL 4 


where w; refers to the weight vector that stores the weights between 
input and hidden nodes, y; refers to the weight vector that stores the 
weights between the hidden and output nodes, b; refers to the 
threshold of the jth hidden nodes. The objective is that ox and t 
(original target) should have zero difference [23] using possible 
activation functions that include sigmoid, sine, radial basis and 
hard-limit. 


In our case, the output of the ELM are three continuous values that 
represent the values assigned to the three learning outcome 
categories (Remember, Apply and Transfer). To convert the three 
values into a binary value for comparing the predicted labels with 
the actual labels, we set the learning outcome category with the 
highest value to one and the remaining two to zero. 


3.4 Support vector machine 

Support vector machine (SVM) is a mapping of data samples such 
that these samples can be distinctly labeled. The concept of SVM 
is derived from margins and subsequently separating data into 
groups with large gaps between them. Deriving an optimal 
hyperplane for identifying linearly separable patterns is the key to 
SVM. This idea is extended to cases where the patterns are non- 
linearly separable, by using a kernel function to transform the 
original data samples to map onto a new space [28]. Possible 
kernels are: linear, polynomial, radial basis and sigmoid. 


For our work, we used the C-support vector classification type. 
Given a set of inputs and targets, the cost function is given by [29] 


roa | k 
ane =P'P +C Dea Sj (5) 


subject to y;(p" o(vj) +m) >1- &,€ =0,j =1,..,k 
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where C>0 is the regularization parameter, m is a constant, p is the 
vector of coefficients, ¢; refers to parameters that handle the inputs, 
index j refers to labeling the k training cases, v refers to the 
independent variables, y refers to the class labels, @ refers to the 
kernel used that transforms data from the input to the chosen feature 
space. 


Fundamentally, support vectors are data points that lie close to the 
decision boundary, which are the hardest to classify. SVM 
maximizes the margin around the hyperplane that separates these 
points. The cost function is determined based on the training 
samples (support vectors). These support vectors are the basic 
elements of a training set that would change the position of the 
hyperplane dividing the dataset. SVM becomes an optimization 
problem for determining the optimal hyperplane. 


3.5 Performance metrics 

To evaluate the reliability of our four technique combinations with 
the subject matter expert’s labels, we looked at using the Fl 
measure. Accuracy is the number of correct labels divided by the 
size of testing data. The Fl measure is a harmonic mean of two 
other metrics: precision and recall. Precision refers to the 
correctness of questions that have been selected as a particular 
category. Recall refers to the correctness of selection of the correct 
category given all the questions that were correctly classified. 


Because minimizing the number of false positives and false 
negatives was important for accurately assigning new questions to 
the correct practice sets, we used the Fl measure as the basis for 
our algorithm comparisons. To explain the Fl measure, we will step 
through the confusion matrix used to describe the performance of a 
labeling model on a set of testing data. There are four concepts used 
to construct the confusion matrix: 


True positive (TP) refers to the number of questions that the 
algorithm correctly identifies as presenting a label. 


False positive (FP) refers to the number of questions that the 
algorithm identifies as presenting a label while the subject matter 
expert indicates the label was absent. 


True negative (TN) refers to the number of questions that the 
algorithm correctly identifies as having a label absent. 


False negative (FN) refers to the number of questions that the 
algorithm identifies as having a label absent while the subject 
matter expert indicates the label was present. 


The F1 measure is calculated as follows [30] 
TP 


Precision = aa (6) 
TP 
Recall = Gee (7) 


2 X precision X recall 


(8) 


F1 measure = — 
precisiton+recall 


4. RESULTS AND ANALYSIS 
4.1 Insights by subject matter expert 


When looking at every question presented to students over the 
course of a semester, the subject matter expert identified the 
number of questions corresponding to Remember, Apply and 
Transfer as shown in Table 4. Just by labeling the course questions, 
the subject matter expert realized how misaligned the course’s 
learning outcomes were with its assessment practices. A large 
emphasis on Apply questions was expected, but the dearth of 
Transfer questions was surprising. Of those 23 Transfer items, most 
were presented during the final exam. 


Table 4 - Frequency of questions aligned to learning outcomes 


Frequency (number of questions) 


Apply 131 


One of the stated learning outcomes of the course was to prepare 
students to flexibly transfer course content to novel problems and 
new situations. However, waiting until the final exam to present 
students with such opportunities denied them actionable feedback 
during the semester. In response to the pre-processing labeling 
efforts, the subject matter expert then added 42 new transfer 
questions throughout the course for the next semester. 


4.2 Model reliability with subject matter 


expert 

The objective of this implementation is to evaluate whether the 
trained model is able to predict the type of question (Remember, 
Apply or Transfer). Based on the trained model using questions 
from the undergraduate course, the testing questions from 
textbooks and online sources were passed through our model to 
determine the level of reliability of labeling new questions that 
were not generated by the subject matter expert. In our intended use 
case, the testing dataset would not need to be manually labeled. 
However, to determine the level of reliability of our labeling 
algorithms, the subject matter expert’s manual labels served as a 
ground truth for the Fl measure calculations. 


4.2.1 Parameter selection 

We first determined the best set of parameters based on 10-fold 
cross validation of the training dataset. As there were 120 
questions, 90% of the questions (108 questions) were used for 
training and 10% of the questions (12 questions) were used as a 
validation set. This process was done 10 times using 10 different 
bundles of the 120 questions. The best set of parameters were 
chosen based on a grid search for both ELM and SVM. 


The parameters that were varied for ELM were: 


1. Number of hidden nodes 
2. Activation function (sigmoid / radial basis / hard-limit) 


The parameters yielding the best results corresponded to 72 hidden 
nodes using hard-limit activation function. 


The parameters that were varied for SVM were: 


1. Kernel (sigmoid / radial basis) 
2. Cost value 
3. Gamma value 


The parameters yielding the best results corresponded to sigmoid 
kernel, cost value = 1, gamma value = 0.26 


4.2.2 Comparing four combinations 

With respect to the Fl measure, calculations were done separately 
for the three labels. The mean of those calculations was then used 
as the algorithm’s overall performance measure. With respect to 
ELM, the calculation was repeated 10 times because the 
initialization weights are randomly assigned in each iteration. The 
mean value of the Fl measure was taken. 


Table 5 below shows the Fl measure values (for each individual 
class and overall Fl mean) for the four combinations. “R” refers to 
Remember, “A” refers to Apply, “T”’ refers to Transfer and “s.d.” 
refers to standard deviation. 


Proceedings of the 10th International Conference on Educational Data Mining 61 


Table 5 - F1 measure values for four combinations 


[Combination | RA [ 1 | Mean | aa 


1. TE-IDF | 0.870 | 0.737 | 0.667 | 0.758 | 0.084 


with SVM 


3. TF-IDF 


with ELM 0.926 0.815 | 0.840 | 0.860 | 0.048 
4. LDA with 
ELM 0.467 0.520 | 0.647 | 0.545 | 0.076 


TF-IDF with ELM achieved the highest mean Fl measure value 
and the lowest standard deviation — indicating that it was the most 
reliable combination. It can be seen that the Remember label yields 
the highest F1 values out of the three labels in Combination 3. In 
general, Remember-labeled questions are short, resulting in about 
four to five zero values in the TF-IDF vector of 10 columns that is 
passed as an input into the ELM. Hence, the algorithm identifies 
Remember-labeled questions very accurately due to their size. 


2. LDA with 
0.400 0.593 | 0.556 | 0.516 | 0.084 


The result of high reliability in using ELM is as expected because 
it has already been demonstrated that ELM outperforms SVM when 
comparing in terms of standard deviation of training and testing 
root-mean-square values, time taken, network complexity, as well 
as performance comparison in real medical diagnosis application 
[23]. On the other hand, although LDA has been shown to achieve 
higher performance as it groups words together in terms of topics 
instead of looking at combinations of individual words which may 
not link together, in the context of our work, TF-IDF outperforms 
LDA instead. This is because for LDA, the goal is to correctly 
assign each document (or question) to a class label in a reduced 
dimensional space [31]. However, in our corpus of questions, there 
are several technical terms involved, without any prior labeling of 
topics. Hence, LDA is not appropriate for our analysis. 


5. CONCLUSIONS 

Based on the comparison of our four algorithms, our most reliable 
model (TF-IDF with ELM) is able to accurately label new course 
questions for the undergraduate electrical and _ electronic 
engineering course with 0.86 reliability in terms of Fl measure. 
Any novice instructor who takes over this course in the future or 
teaching assistants tasked with refreshing the course assignments 
would be able to extract new questions from any external source 
and pass them to the algorithm to automatically label the questions 
as the original course coordinator would. This allows members of 
the course design team without a strong background in learning to 
make curriculum decisions regarding the alignment of the course’s 
learning outcomes. 


As discussed earlier, outcome-based learning environments 
facilitate transforming the model of instruction from instructor- 
centric and lecture-based to being more learner focused filled with 
a variety of activities and learning pathways. However, in learner- 
centered environments, assessment is still the key driver, and often 
the key inhibitor of learning [3]. If the assessments require shallow 
understanding, then learners calibrate their efforts to achieve this 
low bar. When assessments require deep understanding or great 
proficiency, learners are likely to put in more effortful practice. 


In line with this assessment philosophy, our TF-IDF with ELM 
model is theoretically capable of matching any learning activity to 
any set of learning outcomes as long as the course designers or 
subject matter experts provide enough examples that are explicitly 


aligned to the intended learning outcomes when training the model. 
For the convenience of the subject matter expert in our context, we 
used a reduced version of Bloom’s Taxonomy in this study. 
However, the final algorithm is capable of using the full Bloom’s 
model, a different model, or a custom set of learning outcomes as 
its labeling framework. 


Hence, with the high reliability of the prediction algorithm 
presented in our work, our process for calibrating the algorithm can 
be used in any academic or industrial setting to provide the right set 
of formative assessment opportunities to students (enhancing 
subject knowledge) or employees (professional development). 
Once the learning outcomes of activities are labeled reliably, it is 
then easier to think about how to engage learners in deliberate 
practice to reach those outcomes and develop their expertise. Once 
opportunities for deliberate practice that align to the course learning 
outcomes are implemented into a course, it becomes easier to think 
about how to align the feedback regarding those opportunities to 
support the development of domain expertise. 


This work provides a first step at being able to regularly introduce 
learning activities that promote the development of adaptive 
expertise into a course by matching external sources of activities 
with the course’s learning outcomes. Deliberate practice requires 
repetition that varies in ways that highlight the structural elements 
of a domain. Having a way to incorporate new sources of questions 
and problems into a course that align with the course’s goals 
provides learners more opportunities for internalizing when to 
apply their domain specific skills and knowledge. Finally, our 
algorithm is potentially useful for designing courses to reach non- 
content-based learning outcomes, making policies that support 
constructive alignment, and evaluating course assessment of 
learning plans. 


6. FUTURE WORK 


Building off of our machine learning labeling work, we would like 
to explore constructing a new version of LDA that can be tailor- 
made to label questions. There are situations in which weightages 
given to words are the same, with different words representing 
those weightages. Similarly, the same words can have different 
weightages. We are keen to continue working on features based on 
word arrangement, word context and word order that affect 
weightage assignments. In addition, ELM can be enhanced by 
using kernels. 


From the learning aspect, we would like to extend our question 
label categories to all six outcomes described in Bloom’s 
Taxonomy and expand the model to label outcomes based on the 
types of sentences used in forum conversations and other 
collaborative learning activities. Eventually, we aim to determine 
the proficiency level of learners so we can put learning supports in 
place to guide their learning journeys. Ultimately, we wish to 
provide learners with learning activities and opportunities for 
deliberate practice embedded with actionable feedback to develop 
their adaptive expertise. 
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ABSTRACT 


We propose a new model for learning that relates video- 
watching behavior and engagement to quiz performance. In 
our model, a learner’s knowledge gain from watching a lecture 
video is treated as proportional to their latent engagement 
level, and the learner’s engagement is in turn dictated by a set 
of behavioral features we propose that quantify the learner’s 
interaction with the lecture video. A learner’s latent concept 
knowledge is assumed to dictate their observed performance 
on in-video quiz questions. One of the advantages of our 
method for determining engagement is that it can be done 
entirely within standard online learning platforms, serving 
as a more universal and less invasive alternative to existing 
measures of engagement that require the use of external 
devices. We evaluate our method on a real-world massive 
open online course (MOOC) dataset, from which we find that 
it achieves high quality in terms of predicting unobserved 
first-attempt quiz responses, outperforming two state-of-the- 
art baseline algorithms on all metrics and dataset partitions 
tested. We also find that our model enables the identification 
of key behavioral features (e.g., larger numbers of pauses 
and rewinds, and smaller numbers of fast forwards) that are 
correlated with higher learner engagement. 


Keywords 
Behavioral data, engagement, latent variable model, learning 
analytics, MOOC, performance prediction 


1, INTRODUCTION 


The recent and rapid development of online learning plat- 
forms, coupled with advancements in machine learning, has 
created an opportunity to revamp the traditional “one-size- 
fits-all” approach to education. This opportunity is facilitated 
by the ability of many learning platforms, such as massive 
open online course (MOOC) platforms, to collect several 
different types of data on learners, including their assessment 
responses as well as their learning behavior |9]. The focus 
of this work is on using different forms of data to model 
the learning process, which can lead to effective learning 
analytics and potentially improve learning efficacy. 


1.1 Behavior-based learning analytics 

Current approaches to learning analytics are focused mainly 
on providing feedback to learners about their knowledge 
states — or the level to which they have mastered given con- 
cepts/topics/knowledge components — through analysis of 
their responses to assessment questions [10, 24]. There are 
other cognitive (e.g., engagement [17, 31], confusion [37], and 


emotion [11]) as well as non-cognitive (e.g., fatigue, moti- 
vation, and level of financial support [14]) factors beyond 
assessment performance that are crucial to the learning pro- 
cess as well. Accounting for them thus has the potential to 
yield more effective learning analytics and feedback. 


To date, it has been difficult to measure these factors of the 
learning process. Contemporary online learning platforms, 
however, have the capability to collect behavioral data that 
can provide some indicators of them. This data commonly 
includes learners’ usage patterns of different types of learning 
resources |[12, 15], their interactions with others via social 
learning networks [7, 28], their clickstream and keystroke ac- 
tivity logs [2, 8, 30], and sometimes other metadata including 
facial expressions [35] and gaze location [6]. 


Recent research has attempted to use behavioral data to 
augment learning analytics. [5] proposed a latent response 
model to classify whether a learner is gaming an intelligent 
tutoring system, for example. Several of these works have 
sought to demonstrate the relationship between behavior and 
performance of learners in different scenarios. In the context 
of MOOCs, [22] concluded that working on more assignments 
lead to better knowledge transfer than only watching videos, 
[12] extracted probabilistic use cases of different types of 
learning resources and showed they are predictive of certifica- 
tion, [32] used discussion forum activity and topic analysis to 
predict test performance, and [26] discovered that submission 
activities can be used to predict final exam scores. In other 
educational domains, [2] discovered that learner keystroke 
activity in essay-writing sessions is indicative of essay qual- 
ity, [29] identified behavior as one of the factors predicting 
math test achievement, and [25] found that behavior is pre- 
dictive of whether learners can provide elegant solutions to 
mathematical questions. 


In this work, we are interested in how behavioral data can 
be used to model a learner’s engagement. 


1.2 Learner engagement 

Monitoring and fostering engagement is crucial to education, 
yet defining it concretely remains elusive. Research has 
sought to identify factors in online learning that may drive 
engagement; for example, [17] showed that certain production 
styles of lecture videos promote it. [20] defined disengagement 
as dropping out in the middle of a video and studied the 
relationship between disengagement and video content, while 
[31] considered the relationship between engagement and the 
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semantic features of mathematical questions that learners 
respond to. [33] studied the relationship between learners’ 
self-reported engagement levels in a learning session and their 
facial expressions immediately following in-session quizzes, 
and [34] considered how engagement is related to linguistic 
features of discussion forum posts. 


There are many types of engagement [3], with the type of 
interest depending on the specific learning scenario. Several 
approaches have been proposed for measuring and quan- 
tifying different types. These approaches can be roughly 
divided into two categories: device-based and activity-based. 
Device-based approaches measure learner engagement using 
devices external to the learning platform, such as cameras to 
record facial expressions [35], eye-tracking devices to detect 
mind wandering while reading text documents [6], and pupil 
dilation measurements, which are claimed to be highly corre- 
lated with engagement [16]. Activity-based approaches, on 
the other hand, measure engagement using heuristic features 
constructed from learners’ activity logs; prior work includes 
using replies/upvote counts and topic analysis of discussions 
[28], and manually defining different engagement levels based 
on activity types found in MOOCs [4, 21]. 


Both of these types have their drawbacks. Device-based 
approaches are far from universal in standard learning plat- 
forms because they require integration with external devices. 
They are also naturally invasive and carry potential privacy 
risks. Activity-based approaches, on the other hand, are 
not built on the same granularity of data, and tend to be 
defined from heuristics that have no guarantee of correlating 
with learning outcomes. It is therefore desirable to develop a 
statistically principled, activity-based approach to inferring 
a learner’s engagement. 


1.3. Our approach and contributions 

In this paper, we propose a probabilistic model for inferring a 
learner’s engagement level by treating it as a latent variable 
that drives the learner’s performance and is in turn driven 
by the learner’s behavior. We apply our framework to a 
real-world MOOC dataset consisting of clickstream actions 
generated as learners watch lecture videos, and question 
responses from learners answering in-video quiz questions. 


We first formalize a method for quantifying a learner’s behav- 
ior while watching a video as a set of nine behavioral features 
that summarize the clickstream data generated (Section 2). 
These features are intuitive quantities such as the fraction 
of video played, the number of pauses made, and the aver- 
age playback rate, some of which have been associated with 
performance previously [8]. Then, we present our statistical 
model of learning (Section 3) as two main components: a 
learning model and a response model. ‘The learning model 
treats a learner’s gain in concept knowledge as proportional 
to their latent engagement level while watching a lecture 
video. Concept knowledge is treated as multidimensional, on 
a set of latent concepts underlying the course, and videos 
are associated with varying levels to different concepts. The 
response model treats a learner’s performance on in-video 
quiz questions, in turn, as proportional to their knowledge 
on the concepts that this particular question relates to. 


By defining engagement to correlate directly with perfor- 


mance, we are able to learn which behavioral features lead to 
high engagement through a single model. This differs from 
prior works that first define heuristic notions of engagement 
and subsequently correlate engagement with performance, 
in separate procedures. Moreover, our formulation of latent 
engagement can be made from entirely within standard learn- 
ing platforms, serving as a more universally applicable and 
less invasive alternative to device-based approaches. 


Finally, we evaluate two different aspects of our model (Sec- 
tion 4): its ability to predict unobserved, first-attempt quiz 
question responses, and its ability to provide meaningful 
analytics on engagement. We find that our model predicts 
with high quality, achieving AUCs of up to 0.76, and out- 
performing two state-of-the-art baselines on all metrics and 
dataset partitions tested. One of the partitions tested cor- 
responds to the beginning of the course, underscoring the 
ability of our model to provide early detection of struggling 
or advanced students. In terms of analytics, we find that 
our model enables us to identify behavioral features (e.g., 
large numbers of pauses and rewinds, and small numbers of 
fast forwards) that indicate high learner engagement, and to 
track learners’ engagement patterns throughout the course. 
More generally, these findings can enable an online learn- 
ing platform to detect learner disengagement and perform 
appropriate interventions in a fully automated manner. 


2. BEHAVIORAL DATA 


In this section, we start by detailing the setup of lecture 
videos and quizzes in MOOCs. We then specify video- 
watching clickstream data and our method for summarizing 
it into behavioral features. 


2.1 Course setup and data capture 

We are interested in modeling learner engagement while 
watching lecture videos to predict their performance on in- 
video quiz questions. For this purpose, we can view an 
instructor’s course delivery as the sequence of videos that 
learners will watch interspersed with the quiz questions they 
will answer. Let Q = (qi, q2,...) be the sequence of questions 
asked through the course. A video could have any number 
of questions generally, including none; to enforce a 1:1 cor- 
respondence between video content and questions, we will 
consider the “video” for question gn to be all video content 
that appears between gn—1 and gn. Based on this, we will 
explain the formats of video-watching and quiz response data 
we work with in this section. 


Our dataset. The dataset we will use is from the fall 2012 
offering of the course Networks: Friends, Money, and Bytes 
(FMB) on Coursera [1]. This course has 92 videos distributed 
among 20 lectures, and exactly one question per video. 


2.1.1 Video-watching clickstreams 

When a learner watches a video on a MOOC, their behavior 
is typically recorded as a sequence of clickstream actions. 
In particular, each time a learner makes an action — play, 
pause, seek, ratechange, open, or close — on the video 
player, a clickstream event is generated. Formally, the zth 
event created for the course will be in the format 


/ 
EB; F< Ui, Vis Ci, Pi, Pi, Li, $i, Ti > 
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Here, u; and v; are the IDs of the specific learner (user) and 
video, respectively, and e; is the type of action that u; made 
on v;. p; is the position of the video player (in seconds) 
immediately after e; is made, p, is the position immediately 
before, x; is the UNIX timestamp (in seconds) at which e; 
was fired, s; is the binary state of the video player — either 
playing or paused — once this action is made, and r; is the 
playback rate of the video player once this action is made. 
Our FMB dataset has 314,632 learner-generated clickstreams 
from 3,976 learners.” 


The set Fu = {Fi |ui = u, vi = v} of clickstreams for learner 
u recorded on video v can be used to reconstruct the behavior 
u exhibits on v. In Section 2.2 we will explain the features 
computed from F,,,, to summarize this behavior. 


2.1.2 Quiz responses 
When a learner submits a response to an in-video quiz ques- 
tion, an event is generated in the format 


Am =< Um, Um;Lm,Am;,Ym > 


Again, Um and vm are the learner and video IDs (i.e., the 
quiz corresponding to the video). 2m is the UNIX timestamp 
of the submission, am is the specific response, and ym is the 
number of points awarded for the response. The questions 
in our dataset are multiple choice with a single response, so 
Ym is binary-valued. 


In this work, we are interested in whether quiz responses 
were correct on first attempt (CFA) or not. As a result, 
with Au» = {Am|Um = U,Um = v}, we consider the event 
A’,., in this set with the earliest timestamp x,,,. We also 
only consider the set of clickstreams Ey, C Eu,» that occur 
before x,,,, as the ones after would be anti-causal to CFA. 


2.2 Behavioral features and CFA score 

With the data FE, and Aj, we construct two sets of in- 
formation for each learner u on each video v, 7.e., each 
learner-video pair. First is a set of nine behavioral features 
that summarize u’s video-watching behavior on v [8]: 


(1) Fraction spent. The fraction of time the learner spent 
on the video, relative to the playback length of the video. 
Formally, this quantity is eu,,/lv, where 


ae S_ min(wiss — xj, ly) 


tES 


is the elapsed time on v obtained by finding the total UNIX 
time for u on v, and I, is the length of the video (in seconds). 
Here, S = {i € Ay, : ai41 # open}. Ll, is included as an 
upper bound for excessively long intervals of time. 


(2) Fraction completed. The fraction of the video that the 
learner completed, between 0 (none) and 1 (all). Formally, 
it is Cu,v/ly, where Cy,» is the number of unique 1 second 
segments of the video that the learner visited. 


‘p; and p;, will only differ when i is a skip event. 

This number excludes invalid stall, null, and error events, 
as well as open and close events which are generated auto- 
matically. 
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Number of videos completed for each learner 
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Figure 1: Distribution of the number of videos that 
each each learner completed in FMB. More than 
85% of learners completed less than 20 videos. 


(3) Fraction played. The fraction of the video that the 
learner played relative to the length. Formally, it is calculated 
aS Ju,v/lyv, where 


guyv = S_ min(pi41 — Pi; le) 
iES 
is the total length of video that was played (while in the 
playing state). Here, S = {i € Ay, : ai41 F open A 3; = 
playing}. 


(4) Fraction paused. The fraction of time the learner 
stayed paused on the video relative to the length. It is 
calculated as hi,» /ly, where 


Pai. = Ss” min(ti41 = ti, Ls) 


tES 


is the total time the learner stayed in the paused state on this 
video. Here, S = {1 € Ai, : Gi41 # open A 8; = paused}. 


(5) Number of pauses. The number of times the learner 
paused the video, or 


3 1{a; = pause} 
te AL i 


where 1{} is the indicator function. 


(6) Number of rewinds. The number of times the learner 
skipped backwards in the video, or 


Ss” l{a; = skip A p, < pi} 


t€Ai,e 


(7) Number of fast forwards. The number of times the 
learner skipped forward in the video, i.e., with p;, > p; in the 
previous equation. 


(8) Average playback rate. The time-average of the 
learner’s playback rate on the video. Formally, it is calculated 
as 


yee Th * min(x;41 — Vi, ly) 


ee min(r41 — xr, ly) 


Tu,v — 


where S = {i € Al, : i41 # Open A 8s; = playing}. 
; Pp playing 
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(9) Standard deviation of playback rate. The standard 
deviation of the learner’s playback rate. It is calculated as 


ies MiNn(@i41 — Li, ly) 


with the same S as the average playback rate. 


The second piece of information for each learner-video pair 
is u’s CFA score yu,v € {0,1} on the quiz question for v. 


2.3 Dataset subsets 


We will consider different groups of learner-video pairs when 
evaluating our model in Section 4. Our motivation for doing 
so is the heterogeneity of learner motivation and high dropoff 
rates in MOOCs [9]: many will quit the course after watching 
just a few lectures. Modeling in a small subset of data, 
particularly those at the beginning of the course, is desirable 
because it can lead to “early detection” of those who may 
drop out [8]. 


Figure 1 shows the dropoff for our dataset in terms of the 
number of videos each learner completed: more than 85% 
of learners completed just 20% of the course. “Completed” 
is defined here as having watched some of the video and 
responded to the corresponding question. Let J), be the 
number of videos learner u completed and y(v) be the index 
of video v in the course, we define 0%" = {(u,v) : Tu => 
uo A y(v) < vo} to be the subset of learner-video pairs 
such that wu completed at least wo videos and v is within the 
first vo videos. The full dataset is 9'’°?, and we will also 
consider 7"? as the subset of 346 active learners over the 
full course and "7° as the subset of all learners over the 
first two weeks” in our evaluation. 


3. STATISTICAL MODEL OF LEARNING 
WITH LATENT ENGAGEMENT 


In this section, we propose our statistical model. Let U 
denote the number of learners (indexed by u) and V the 
number of videos (indexed by v). Further, we use T,, to 
denote the number of time instances registered by learner 
u (indexed by t); we take a time instance to be a learner 
completing a video, i.e., watching a video and answering the 
corresponding quiz question. For simplicity, we use a discrete 
notion of time, i.e., each learner-video pair will correspond 
to one time instance for one learner. 


Our model considers learners’ responses to quiz questions 
as measurements of their underlying knowledge on a set of 
concepts; let AK denote the number of such concepts. Further, 
our model considers the action of watching lecture videos 
as part of learning that changes learners’ latent knowledge 
states over time. These different aspects of the model are 
visualized in Figure 2: there are two main components, a 
response model and a learning model. 


3.1 Response Model 


Our statistical model of learner responses is given by 


p(y? — 1|cy”) — o(Waru,t)ou — Mu(u,t) =r Gs), (1) 


3In FMB, the first two weeks of lectures is the first 20 videos. 


response model 


a 
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cD dy (u,t) e(!) c(t) 
[> eS Oe 
time t 


Figure 2: Our proposed statistical model of learning 
consists of two main parts, a response model and a 
learning model. 


where v(u,t) : Q C {1,...,U} x {1,...,maxuTu} - 
{1,...,V} denotes a mapping from a learner index-time 
index pair to the index of the video v that wu was watching at 
t. ye {0,1} is the binary-valued CFA score of learner u 
on the quiz question corresponding to the video they watch 
at time t, with 1 denoting a correct response (CFA) and 0 
denoting an incorrect response (non-CFA). 


The variable w, € R* denotes the non-negative, K- 
dimensional quiz question—concept association vector that 
characterizes how the quiz question corresponding to video vu 
tests learners’ knowledge on each concept, and the variable 
[ty is a scalar characterizing the intrinsic difficulty of the quiz 
question. ra is the K-dimensional concept knowledge vector 
of learner u at time ¢t, characterizing the knowledge level of 
the learner on each concept at the time, and a, denotes the 
static, intrinsic ability of learner u. Finally, a(x) = 4+_, is 


al 
the sigmoid function. ° 


We restrict the question—concept association vector Ww, to be 
non-negative in order to make the parameters interpretable 
[24]. Under this restriction, the values of concept knowledge 
vector ol) can be understood as follows: large, positive values 
lead to higher chances of answering a question correctly, thus 
corresponding to high knowledge, while small, negative values 
lead to lower chances of answering a question correctly, thus 
corresponding to low knowledge. 


3.2 Learning Model 


Our model of learning considers transitions in learners’ knowl- 
edge states as induced by watching lecture videos. It is given 


by 
of!) — ci) + ef ducu,t); 7 eee ee (2) 


where the variable d, € R’ denotes the non-negative, K- 
dimensional learning gain vector for video v; each entry 
characterizes the degree to which the video improves learners’ 
knowledge level on each concept. The assumption of non- 
negativity on d, implies that videos will not negatively affect 
learners’ knowledge, as in [23]. c\”) is the initial knowledge 
state of learner u at time ¢ = O, i.e., before starting the 
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(20,92 


(71:20 qQi.92 


ACC AUC ACC 


AUC ACC AUC 


Proposed model 0.7293+0.0070 0.7608+0.0094 0.7096+0.0057 0.7045+0.0066 0.7058+0.0054 0.7216+0.0054 


SPARFA 0.7209+ 0.0070 
BKT 0.7038+ 0.0084 


0.7532+0.0098 
0.7218+0.0126 


0.7061+0.0069 
0.6825 0.0058 


0.7020+0.0070 
0.6662+ 0.0065 


0.6975+0.0048 
0.6803+0.0055 


0.7124+0.0050 
0.6830 40.0059 


Table 1: Quality comparison of the different algorithms on predicting unobserved quiz question responses. 
The obtained ACC and AUC metrics on different subsets of the FMB dataset are given. Our proposed model 
obtains higher quality than the SPARFA and BKT baselines in each case. 


course and watching any video. 


The scalar latent variable ef!) € [0,1] in (2) characterizes 
the engagement level that learner u exhibits when watching 
video u(u,t) at time t. This is in turn modeled as 


e = o( 8" £0"), (3) 


where a is a 9-dimensional vector of the behavioral features 
defined in Section 2.2, summarizing learner u’s behavior while 
the video at time t. G is the unknown, 9-dimensional pa- 
rameter vector that characterizes how engagement associates 
with each behavioral feature. 


Taken together, (2) and (3) state that the knowledge gain a 
learner will experience on a particular concept while watching 
a particular video is given by 


(i) the video’s intrinsic association with the concept, mod- 
ulated by 


(ii) the learner’s engagement while watching the video, as 
manifested by their clickstream behavior. 


From (2), a learner’s (latent) engagement level dictates the 
fraction of the video’s available learning gain they acquire 
to improve their knowledge on each concept. ‘The response 
model (1) in turn holds that performance is dictated by a 
learner’s concept knowledge states. In this way, engagement 
is directly correlated with performance through the concept 
knowledge states. Note that in this paper, we treat the en- 
gagement variable est) as a scalar; the extension of modeling 
it as a vector and thus separating engagement by concept is 
part of our ongoing work. 


It is worth mentioning the similarity between our character- 
ization of engagement as a latent variable in the learning 
model and the input gate variables in long-short term mem- 
ory (LSTM) neural networks [18]. In LSTM, the change 
in the latent memory state (loosely corresponding to the 
latent concept knowledge state vector c(!)) is given by the 
input vector (loosely corresponding to the video learning 


gain vector d,) modulated by a set of input gate variables 
(corresponding to the engagement variable el), 


Parameter inference. Our statistical model of learning 
and response can be seen as a particular type of recurrent neu- 
ral network (RNN). Therefore, for parameter inference, we 
implement a stochastic gradient descent algorithm with stan- 
dard backpropagation. Given the graded learner responses 


ys) and behavioral features fie) our parameter inference 


algorithm estimates the quiz question—concept association 
vectors Wy, the quiz question intrinsic difficulties z,, the the 
video learning gain vectors d,, the learner initial knowledge 
vectors c?) the learner abilities a,, and the engagement— 
behavioral feature association vector 3. We omit the details 
of the algorithm for simplicity of exposition. 


4. EXPERIMENTS 

In this section, we evaluate the proposed latent engagement 
model on the FMB dataset. We first demonstrate the gain 
in predictive quality of the proposed model over two baseline 
algorithms (Section 4.1), and then show how our model can 
be used to study engagement (Section 4.2). 


4.1 Predicting unobserved responses 
We evaluate our proposed model’s quality by testing its 
ability to predict unobserved quiz question responses. 


Baselines. We compare our model against two well-known, 
state-of-the-art response prediction algorithms that do not 
use behavioral data. First is the sparse factor analysis 
(SPARFA) algorithm [24], which factors the learner-question 
matrix to extract latent concept knowledge, but does not use 
a time-varying model of learners’ knowledge states. Second is 
a version of the Bayesian knowledge tracing (BKT) algorithm 
that tracks learners’ time-varying knowledge states, which 
incorporates a set of guessing and slipping probability pa- 
rameters for each question, a learning probability parameter 
for each video, and an initial knowledge level parameter for 
each learner [13, 27]. 


4.1.1 Experimental setup and metrics 
Regularization. In order to prevent overfitting, we add 
fz-norm regularization terms to the overall optimization 
objective function for every set of variables in both the 
proposed model and in SPARFA. We use a parameter to 
control the amount of regularization on each variable. 


Cross validation. We perform 5-fold cross validation on 
the full dataset (Q'°”), and on each subset of the dataset 
introduced in Section 2.3 (Q7°°? and Q1:7°). To do so, we 
randomly partition each learner’s quiz question responses 
into 5 data folds. Leaving out one fold as the test set, we use 
the remaining four folds as training and validation sets to 
select the values of the tuning parameters for each algorithm, 
i.e., by training on three of the folds and validating on the 
other. We then train every algorithm on all four observed 
folds using the tuned values of the parameters, and evaluate 
them on the holdout set. All experiments are repeated for 
20 random partitions of the training and test sets. 


For the proposed model and for SPARFA, we tune both the 
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Feature Coefficient 
Fraction spent 0.1941 
Fraction completed 0.1443 
Fraction played 0.2024 
Fraction paused 0.0955 
Number of pauses 0.2233 
Number of rewinds 0.4338 
Number of fast forwards —0.1551 
Average playback rate 0.2797 
Standard deviation of playback rate 0.0314 


Table 2: Regression coefficient vector G learned over 
the full dataset, associating each clickstream feature 
to engagement. All but one of the features (number 
of fast forwards) is positively correlated with engage- 
ment. 


number of concepts K € {2,4,6,8,10} and the regulariza- 
tion parameter A € {0.5,1.0,...,10.0}. Note that for the 
proposed model, when a question response is left out as part 
of the test set, only the response is left out of the training 
set: the algorithm still uses the clickstream data for the 
corresponding learner-video pair to model engagement. 


Metrics. ‘To evaluate the quality of the algorithms, we 
employ two commonly used binary classification metrics: 
prediction accuracy (ACC) and area under the receiver oper- 
ating characteristic curve (AUC) [19]. The ACC metric is 
simply the fraction of predictions that are made correctly, 
while the AUC measures the tradeoff between the true and 
false positive rates of the classifier. Both metrics take values 
in [0,1], with larger values indicating higher quality. 


4.1.2 Results and discussion 

Table 1 gives the evaluation results for the three algorithms. 
The average and standard deviation over the 20 random data 
partitions are reported for each dataset group and metric. 


First of all, the results show that our proposed model consis- 
tently achieves higher quality than both baseline algorithms 
on both metrics. It significantly outperforms BKT in par- 
ticular (SPARFA also outperforms BKT). This shows the 
potential of our model to push the envelope on achievable 
quality in performance prediction research. 


Notice that our model achieves its biggest quality improve- 
ment on the full dataset, with a 1.3% gain in AUC over 
SPARFA and a 5.7% gain over BKT. This observation sug- 
gests that as more clickstream data is captured and available 
for modeling — especially as we observe more video-watching 
behavioral data from learners over a longer period of time 
(the full dataset Q'°* contains clickstream data for up to 
12 weeks, while the 21:7" subset only contains data for the 
first 2 weeks) — the proposed model achieves more significant 
quality enhancements over the baseline algorithms. This 
is somewhat surprising, since prior work on behavior-based 
performance prediction [8] has found the largest gains in the 
presence of fewer learner-video pairs, i.e., before there are 
many question responses for other algorithms to model on. 
But our algorithm also benefits from additional question re- 
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Figure 3: Plot of the latent engagement level e) 
over time for one third of the learners in FMB, show- 
ing a diverse set of behaviors across learners. 


sponses, to update its learned relationship between behavior 
and concept knowledge. 


The first two weeks of data (Q'*°) is sparse in that the 
majority of learners answer at most a few questions during 
this time, many of whom will drop out (see Figure 1). In 
this case, our model obtains a modest improvement over 
SPARFA, which is static and uses fewer parameters. The 
gain over BKT is particularly pronounced, at 5.7%. This, 
combined with the findings for active learners over the full 
course (27°:""), shows that observing video-watching behav- 
ior of learners who drop out of the course in its early states 
(these learners are excluded from 7°") leads to a slight 
increase in the performance gain of the proposed model over 
the baseline algorithms. Importantly, this shows that our 
algorithm provides benefit for early detection, with the ability 
to predict performance of learners who will end up dropping 
out [8]. 


4.2 Analyzing engagement 

Given predictive quality, one benefit of our model is that it 
can be used to analyze engagement. The two parameters to 
consider for this are the regression coefficient vector @ and 


the engagement scalar es") itself. 


Behavior and engagement. ‘Table 2 gives each of the 
estimated feature coefficients in @ for the full dataset Q"?, 
with regularization parameters chosen via cross validation. 
All of the features except for the number of fast forwards are 
positively correlated with the latent engagement level. This 
is to be expected since many of the features are associated 
with processing more video content, e.g., spending more 
time, playing more, or pausing longer to reflect, while fast 
forwarding involves skipping over the content. 


The features that contribute most to high latent engagement 
levels are the number of pauses, the number of rewinds, and 
the average playback rate. The first two of these are likely 
indicators of actual engagement as well, since they indicate 
whether the learner was thinking while pausing the video 
or re-visiting earlier content which contains knowledge that 
they need to recall or revise. The strong, positive correlation 
of average playback rate is somewhat surprising though: 
we may expect that a higher playback rate would have a 
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engagement and drop out. 


Figure 4: Plot of the latent engagement level e over time for selected learners in three different groups. 


negative impact on engagement, like fast forwarding does, as 
it involves speeding through content. On the other hand, it 
may be an indication that learners are more focused on the 
material and trying to keep their interest higher. 


Engagement over time. Figure 3 visualizes the evolution 
of ce!) over time for 1/3 of the learners (randomly selected). 
Patterns in engagement differs substantially across learners; 
those who finish the course mostly exhibit high engagement 
levels throughout, while those who drop out early vary greatly 
in their engagement, some high and others low. 


Figure 4 breaks down the learners into three different types 
according to their engagement patterns, and plots their en- 
gagement levels over time separately. The first type of learner 
(a) finishes the course and consistently exhibits high engage- 
ment levels throughout the duration. The second type (b) 
also consistently exhibits high engagement levels, but drops 
out of the course after up to three weeks. The third type of 
learner (c) exhibits inconsistent engagement levels before an 
early dropout. Equipped with temporal plots like these, an 
instructor could determine which learners may be in need 
of intervention, and could design different interventions for 
different engagement clusters [8, 36]. 


5. CONCLUSIONS AND FUTURE WORK 


In this paper, we proposed a new statistical model for learn- 
ing, based on learner behavior while watching lecture videos 
and their performance on in-video quiz questions. Our model 
has two main parts: (i) a response model, which relates a 
learner’s performance to latent concept knowledge, and (ii) 
a learning model, which relates the learner’s concept knowl- 
edge in turn to their latent engagement level while watching 
videos. Through evaluation on a real-world MOOC dataset, 
we showed that our model can predict unobserved question 
responses with superior quality to two state-of-the-art base- 
lines, and also that it can lead to engagement analytics: it 
identifies key behavioral features driving high engagement, 
and shows how each learner’s engagement evolves over time. 


Our proposed model enables the measurement of engagement 
solely from data that is logged within online learning plat- 
forms: clickstream data and quiz responses. In this way, it 
serves as a less invasive alternative to current approaches 
for measuring engagement that require external devices, e.g., 
cameras and eye-trackers [6, 16, 35]. One avenue of future 
work is to conduct an experiment that will correlate our 
definition of latent engagement with these methods. 


Additionally, one could test other, more sophisticated char- 
acterizations of the latent engagement variable. One such 
approach could seek to characterize engagement as a func- 
tion of learners’ previous knowledge level. An alternative or 
addition to this would be a generative modeling approach of 
engagement to enable the prediction of future engagement 
given each learner’s learning history. 


One of the long-term, end-all goals of this work is the design 
of a method for useful, real-time analytics to instructors. The 
true test of this ability comes from incorporating the method 
into a learning system, providing its outputs — namely, per- 
formance prediction forecasts and engagement evolution — to 
an instructor through the user interface, and measuring the 
resulting improvement in learning outcomes. 
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ABSTRACT 

Gathering labeled data in educational data mining (EDM) 
is a time and cost intensive task. However, the amount 
of available training data directly influences the quality of 
predictive models. Unlabeled data, on the other hand, is 
readily available in high volumes from intelligent tutoring 
systems and massive open online courses. In this paper, we 
present a semi-supervised classification pipeline that makes 
effective use of this unlabeled data to significantly improve 
model quality. We employ deep variational auto-encoders 
to learn efficient feature embeddings that improve the per- 
formance for standard classifiers by up to 28% compared 
to completely supervised training. Further, we demonstrate 
on two independent data sets that our method outperforms 
previous methods for finding efficient feature embeddings 
and generalizes better to imbalanced data sets compared 
to expert features. Our method is data independent and 
classifier-agnostic, and hence provides the ability to improve 
performance on a variety of classification tasks in EDM. 


Keywords 
semi-supervised classification, variational auto-encoder, deep 
neural networks, dimensionality reduction 


1. INTRODUCTION 


Building predictive models of student characteristics such 
as knowledge level, learning disabilities, personality traits 
or engagement is one of the big challenges in educational 
data mining (EDM). Such detailed student profiles allow 
for a better adaptation of the curriculum to the individual 
needs and is crucial for fostering optimal learning progress. 
In order to build such predictive models, smaller-scale and 
controlled user studies are typically conducted where de- 
tailed information about student characteristics are at hand 
(labeled data). The quality of the predictive models, how- 
ever, inherently depends on the number of study partici- 
pants, which is typically on the lower side due to time and 
budget constraints. In contrast to such controlled user stud- 
ies, digital learning environments such as intelligent tutoring 
systems (ITS), educational games, learning simulations, and 
massive open online courses (MOOCs) produce high volumes 
of data. These data sets provide rich information about stu- 
dent interactions with the system, but come with no or only 
little additional information about the user (unlabeled data). 
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Semi-supervised learning bridges this gap by making use of 
patterns in bigger unlabeled data sets to improve predictions 
on smaller labeled data sets. This is also the focus of this 
paper. These techniques are well explored in a variety of 
domains and it has been shown that classifier performance 
can be improved for, e.g., image classification [15], natu- 
ral language processing [28] or acoustic modeling [21]. In 
the education community, semi-supervised classification has 
been used employing self-training, multi-view training and 
problem-specific algorithms. Self-training has e.g. been ap- 
plied for problem-solving performance [22]. In self-training, 
a Classifier is first trained on labeled data and is then itera- 
tively retrained using its most confident predictions on un- 
labeled data. Self-training has the disadvantage that incor- 
rect predictions decrease the quality of the classifier. Multi- 
view training uses different data views and has been explored 
with co-training [27] and tri-training [18] for predicting pre- 
requisite rules and student performance, respectively. The 
performance of these methods, however, largely depends on 
the properties of the different data views, which are not yet 
fully understood [34]. Problem-specific semi-supervised al- 
gorithms have been used to organize learning resources in 
the web [19], with the disadvantage that they cannot be 
directly applied for other classification tasks. 


Recently, it has been shown (outside of the education con- 
text) that variational auto-encoders (VAE) have the poten- 
tial to outperform the commonly used semi-supervised clas- 
sification techniques. VAE is a neural network that includes 
an encoder that transforms a given input into a typically 
lower-dimensional representation, and a decoder that recon- 
structs the input based on the latent representation. Hence, 
VAEs learn an efficient feature embedding (feature repre- 
sentation) using unlabeled data that can be used to im- 
prove the performance of any standard supervised learning 
algorithm [15]. This property greatly reduces the need for 
problem-specific algorithms. Moreover, VAEs feature the 
advantage that the trained deep generative models are able 
to produce realistic samples that allow for accurate data 
imputation and simulations [23], which makes them an ap- 
pealing choice for EDM. Inspired by these advantages, and 
the demonstrated superior classifier performance in other 
domains as in computer vision [16, 23], this paper explores 
VAE for student classification in the educational context. 
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We present a complete semi-supervised classification pipeline 
that employs deep VAEs to extract efficient feature embed- 
dings from unlabeled student data. We have optimized the 
architecture of two different networks for educational data - 
a simple variational auto-encoder and a convolutional varia- 
tional auto-encoder. While our method is generic and hence 
widely applicable, we apply the pipeline to the problem of 
detecting students suffering from developmental dyscalculia 
(DD), which is a learning disability in arithmetics. The large 
and unlabeled data set at hand consists of student data of 
more than 7K students and we evaluate the performance of 
our pipeline on two independent small and labeled data sets 
with 83 and 155 students. Our evaluation first compares the 
performance of the two networks, where our results indicate 
superiority of the convolutional VAE. We then apply dif- 
ferent classifiers to both labeled data sets, and demonstrate 
not only improvements in classification performance of up to 
28% compared to other feature extraction algorithms, but 
also improved robustness to class imbalance when using our 
pipeline compared to other feature embeddings. The im- 
proved robustness of our VAE is especially important for 
predicting relatively rare student conditions - a challenge 
that is often met in EDM applications. 


2. BACKGROUND 


In the semi-supervised classification setting we have access 
to a large data set Vp without labels and a much smaller 
labeled data set Vs with labels Ys. The idea behind semi- 
supervised classification is to make use of patterns in the 
unlabeled data set to improve the quality of the classifier 
beyond what would be possible with the small data set 
Xs alone. There are many different approaches to semi- 
supervised classification including transductive SVMs, graph- 
based methods, self-training or representation learning [35]. 
In this work we focus on learning an efficient encoding z = 
E(x) for x € 4p of the data domain using the unlabeled 
data ¥p only. This learnt data transformation E(-) - the 
encoding - is then applied to the labeled data set Vs. Well- 
known encoders include principle component analysis (PCA) 
or Kernel PCA (KPCA). PCA is a dimensionality reduction 
method that finds the optimal linear transformation from 
an N-dimensional to a K-dimensional space (given a mean- 
squared error loss). Kernel PCA [24] extends PCA allowing 
non-linear transformations into a K-dimensional space and 
has, among others, been successfully used for novelty detec- 
tion in non-linear domains [11]. Recently, variational auto- 
encoders (VAE) have outperformed other semi-supervised 
classification techniques on several data sets [15]. VAE com- 
bine variational inference networks with generative models 
parametrized by deep neural networks that exploit informa- 
tion in the data density to find efficient lower dimensional 
representations (feature embeddings) of the data. 


Auto-encoder. An auto-encoder or autoassociator [2] is a 
neural network that encodes a given input into a (typically 
lower dimensional) representation such that the original in- 
put can be reconstructed approximately. The auto-encoder 
consists of two parts. The encoder part of the network takes 
the N-dimensional input x € R“ and computes an encod- 
ing z = E(x) while the decoder D reconstructs the input 
based on the latent representation x = D(z). If we train 
a network using the mean squared error loss and the net- 
work consists of a single linear hidden layer of size K, e.g. 
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E(x) = Wix+ bi and D(z) = Wez-+ be for weights 
W, © R*** and We € R™** and offsets b; € R* and 
be € R”, the autoencoder behaves similar to PCA in that 
the network learns to project the input into the span of 
the K first principle components [2]. For more complex net- 
works with non-linear layers multi-modal aspects of the data 
can be learnt. Auto-encoders can be used in semi-supervised 
classification tasks because the encoder can compute a fea- 
ture representation z of the original data x. These features 
can then be used to train a classifier. The learnt feature 
embedding facilitates classification by clustering related ob- 
servations in the computed latent space. 


Variational auto-encoder. Variational auto-encoders [15] 
are generative models that combine Bayesian inference with 
deep neural networks. They model the input data x as 


po(x|z) = f(x; z, 8) p(z) = N (20, 2) (1) 


where f is a likelihood function that performs a non-linear 
transformation with parameters 0 of z by employing a deep 
neural network. In this model the exact computation of 
the posterior pe(z|x) is not computationally tractable. In- 
stead, the true posterior is approximated by a distribution 
q¢(z|x) [16]. This inference network q4(z|x) is parametrized 
as a multivariate normal distribution as 


q(z|x) = N(z|Ho(x), diag(og(x))), (2) 


where j4g(x) and 04 (x) denote vectors of means and variance 


respectively. Both functions w¢(-) and o3(-) are represented 
as deep neural networks. Hence, variational autoencoders 
essentially replace the deterministic encoder E(x) and de- 
coder D(z) by a probabilistic encoder qg(z|x) and decoder 
poe(x|z). Direct maximization of the likelihood is computa- 
tionally not tractable, therefore a lower bound on the likeli- 
hood has been derived [16]. The learning task then amounts 
to maximizing this variational lower bound 


Eq,,(z\x) log po(x|z)] — KL [qe(z|x)||p(2)] , (3) 


where KL denotes the Kullback-Leibler divergence. ‘The 
lower bound consists of two intuitive terms. The first term 
is the reconstruction quality while the second one regular- 
izes the latent space towards the prior p(z). We perform 
optimization of this lower bound by applying a stochastic 
optimization method using gradient back-propagation [14]. 


3. METHOD 


In the following we introduce two networks. First, a simple 
variational auto-encoder consisting of fully connected lay- 
ers to learn feature embeddings of student data. These en- 
coders have shown to be powerful for semi-supervised clas- 
sification [15], and are often applied due to their simplicity. 
Second, an advanced auto-encoder that combines the advan- 
tages of VAE with the superiority of asymmetric encoders. 
This is motivated by the fact that asymmetric auto-encoders 
have shown superior performance and more meaningful fea- 
ture representations compared to simple VAE in other do- 
mains such as image synthesis [29]. 


Student snapshots. There are many applications where 
we want to predict a label y, for each student n within an 
ITS based on behavioral data X,. These labels typically 
relate to external variables or properties of a student, such 
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Simple student auto-encoder (S-SAE) 


CNN student auto-encoder (CNN-SAE) 
qo @ Ix) 


Do {x |z) 


ES) recurrent LSTM convolutional layer 


Figure 1: Network layouts for our simple student auto-encoder (left) using only fully connected layers and our 
improved CNN student auto-encoder (right) using convolutions for the encoder and recurrent LSTM layers 
for the decoder. In contrast to standard auto-encoders, the connections to the latent space z are sampled 


(red dashed arrows) from a Gaussian distribution. 


as age, learning disabilities, personality traits, learner types, 
learning outcome etc. Similar to Knowledge Tracing (KT) 
we propose to model the data Xn, = {Xn1,...,Xn7r} asa 
sequence of J’ observations. In contrast to KT we store F 
different feature values xn, € R* for each element in the 
sequ th opportunity within a task. 
This allows us to simultaneously store data from multiple 
tasks in Xnt, ¢.g. Xni stores all features for student n that 
were observed during the first task opportunities. For ev- 
ery task in an ITS we can extract various different features 
that characterize how a student n was approaching the task. 
These features include performance, answer times, problem 
solving strategies, etc. We combine this information into a 
student snapshot X,, € R*?**", where T is the number of task 
opportunities and F' is the number of extracted features. 


Simple student auto-encoder (S-SAE). Our simple vari- 
ational autoencoder is following the general design outlined 
in Section 2 and is based on the student snapshot represen- 
tation. For ease of notation we use x := vec(Xn), where 
vec(-) is the matrix vectorization function to represent the 
student snapshot of student n. The complete network lay- 
out is depicted in Figure 1, left. ‘The encoder and decoder 
networks consist of L fully connected layers that are imple- 
mented as an affine transformation of the input followed by 
a non-linear activation function 6(-) as x, = 6(W )xi-1+by)), 
where / is the layer index and W; and b; are a weight matrix 
and offset vector of suitable dimensions. Typical choices for 
G(-) include tanh, rectified linear units or sigmoid functions 
[6]. To produce latent samples z we sample from the normal 
distribution (see Equation (2)) using re-parametrization [16] 


Z = [lg(x) + og(x)e, (4) 


where « ~ N(0,1), to allow for back-propagation of gra- 
dients. For pe(x|z) (see (1)) any suitable likelihood func- 
tion can be used. We used a Gaussian distribution for all 
presented examples. Note that the likelihood function is 
parametrized by the entire (non-linear) decoder network. 


The training of variational auto-encoders can be challenging 
as stochastic optimization was found to set qg(z|x) = p(z) 
in all but vanishingly rare cases [3], which corresponds to a 
local maximum that does not use any information from x. 
We therefore add a warm-up phase that gradually gives the 
regularization term in the target function more weight: 


Eq, (z|x) log pe(x|z)] — a KL [qe(z[x)||p(Z)], (5) 
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where a € [0,1] is linearly increased with the number of 
epochs. ‘The warm-up phase has been successfully used 
for training deep variational auto-encoders [25]. Further- 
more, we initialize the weights of the dense layer computing 
log(o3(x)) to 0 (yielding a variance of 1 at the beginning of 
the training). This was motivated by our observations that if 
we employ standard random weight initialization techniques 
(glorot-norm, he-norm [9]) we can get relatively high initial 
estimates for the variance 03(x), which due to the sampling 
leads to very unreliable samples z in the latent space. The 
large variance in sampled points in the latent space leads to 
bad convergence properties of the network. 


CNN student auto-encoder (CNN-SAE). Following 
the recent findings in computer vision we present a second, 
more advanced network that typically outperforms simpler 
VAE. In [29], for example, these asymmetric auto-encoders 
resulted in superior reconstruction of images as well as more 
meaningful feature embeddings. A specific kind of convolu- 
tional neural network was combined with an auto-encoder, 
being able to directly capture low level pixel statistics and 
hence to extract more high-level feature embeddings. 


Inspired by this previous work, we combine an asymmetric 
auto-encoder (and a decoder that is able to capture low level 
statistics) with the advantages of variational auto-encoders. 
Figure 1, right, shows our combined network. We employ 
multiple layers of one-dimensional convolutions to parametrize 
the encoder q¢(z|x) (again we assume a Gaussian distribu- 
tion, see (2)). The distribution is parametrized as follows: 


bie(x) = W ph + by 
log(a3(x)) = Woh +b, 
h = conv;(x) = 8(W) * conv;_1(x)), 


where * is the convolution operator, W;, W,,, W. are weights 
of suitable dimensions, {(-) is a non-linear activation func- 
tion and | denotes the layer depth. Further, convo(x) = x. 
We keep the standard variational layer (see (4)) while chang- 
ing the output layer to a recurrent layer using long term 
short term units (LSTM). Recurrent layers have success- 
fully been used in auto-encoders before, e.g. in [5]. LSTM 
were very successful for modeling temporal sequences be- 
cause they can model long and short term dependencies be- 
tween time steps. Every LSTM unit receives a copy of the 
sampled points in latent-space, which allows the LSTM net- 
work to combine context information (point in the latent 
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Figure 2: Pipeline overview. We train the variational auto-encoder on a large unlabeled data set. The trained 
encoder of the auto-encoder can be used to transform other data sets into an expressive feature embedding. 
Based on this feature embedding we train different classifiers to predict the student labels. 


space) with the sequence information (memory unit in the 
LSTM cell). Using LSTM cells the decoder pg (x|z) assumes 
a Gaussian distribution and is parametrized as follows: 


Ler(Z) = Wz - Istme(z) + Dyz 


ump (oo.(2)) oe Wee - Istm,(z) + Boe, 


where j19:(z) and o,(z) are the t*” components of jo(z) and 
o3(z), respectively, Istm;(-) denotes the t’” LSTM cell and 
W.. and b, denote suitable weight and offset parameters. 


Feature selection. VAE provide a natural way for per- 
forming feature selection. The inference network q¢(z|x) 
infers the mean and variance for every dimension z;. There- 
fore, the most informative dimension z; has the highest KL 
divergence from the prior distribution p(z;) = .N(0,1) while 
uninformative dimensions will have a KL divergence close to 
0 [10]. The KL divergence of z; to p(zi) is given by 


—f i) y] 
where pi; and o; are the inferred parameter for the Gaussian 


distribution q¢(zi|x). Feature selection proceeds by keeping 
the K dimensions z; with the largest KL divergence. 


KL [qs (zi|)||p(2:)] = — log(oi) + 


Semi-supervised classification pipeline. The encoder 
and the decoder of the variational auto-encoder can be used 
independently of each other. This independence allows us 
to take the trained encoder and map new data to the learnt 
feature embedding. Figure 2 provides an overview of the 
entire pipeline for semi-supervised classification. In a first 
unsupervised step we train a VAE on unlabeled data. The 
learnt encoder q¢(z|x) is then used to transform labeled data 
sets to the feature embedding. We finally apply our feature 
selection step that considers the relative importance of the 
latent dimensions as previously described. We then train 
standard classifiers (Logistic Regression, Naive Bayes and 
Support Vector Machine) on the feature embeddings. 


4. RESULTS 


We evaluated our approach for the specific example of de- 
tecting developmental dyscalculia (DD), which is a learning 
disability affecting the acquisition of arithmetic skills [33]. 
Based on the learnt feature embedding on a large unlabeled 
data set the classifier performance was measured on two in- 
dependent, small and labeled data sets from controlled user 
studies. We refer to them as balanced and imbalanced data 
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sets since their distribution of DD and non-DD children dif- 
fers: the first study has approximately 50% DD, while the 
second one includes 5% DD (typical prevalence of DD). 


4.1 Experimental Setup 

All three data sets were collected from Calcularis, which is 
an intelligent tutoring system (ITS) targeted at elementary 
school children suffering from DD or exhibiting difficulties 
in learning mathematics [13]. Calcularis consists of different 
games for training number representations and calculation. 
Previous work identified a set of games that are predictive 
of DD within Calcularis [17]. Since timing features were 
found to be one of the most relevant indicators for detecting 
DD [4] and to facilitate comparison to other feature embed- 
ding techniques we limited our analysis to log-normalized 
timing features, for which we can assume normal distribu- 
tion [30]. Therefore, we evaluated our pipeline on the sub- 
set of games from [17] for which meaningful timing features 
could be extracted and sufficient samples were available in all 
data sets (we used >7000 samples for training the VAEs). 
Since our pipeline currently does not handle missing data 
only students with complete data were included. 


Timing features were extracted for the first 5 tasks in 5 dif- 
ferent games. The selected games involve addition tasks 
(adding a 2-digit number to a 1-digit number with ten- 
crossing; adding two 2-digit numbers with ten-crossing), num- 
ber conversion (spoken to written numbers in the ranges 0- 
10 and 0-100) and subtraction tasks (subtracting a 1-digit 
number from a 2-digit number with ten-crossing). For every 
task we extracted the total answer time (time between the 
task prompt until the answer was entered) and the response 
time (time between the task prompt and the first input by 
the student). Hence, each student is represented by a 50- 
dimensional snapshot x (see Section 3). 


Unlabeled data set. The unlabeled data set was extracted 
using live interaction logs from the ITS Calcularis. In total, 
we collected data from 7229 children. Note that we have 
no additional information about the children such as DD or 
grade. We excluded all teacher accounts as well as log files 
that were < 20KB. Since every new game in Calcularis is 
introduced by a short video during the very first task, we 
excluded this particular task for all games. 


Balanced data set. The first labeled data set is based 
on log files from 83 participants of a multi-center user study 
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conducted in Germany and Switzerland, where approximately 


half of the participants were diagnosed with DD (47 DD, 36 
control) [31]. During the study, children trained with Cal- 
cularis at home for five times per week during six weeks and 
solved on average 1551 tasks. There were 28 participants 
in 2% grade (9 DD, 19 control), 40 children in 3™¢ grade 
(23 DD, 17 control), 12 children in 4°” grade (12 DD) and 
3 children in 5*” grade (3 DD). The diagnosis of DD was 
based on standardized neuropsychological tests [31]. 


Imbalanced data set. The second labeled data set is from 
a user study conducted in the classroom of ten Swiss elemen- 
tary school classes. In total, 155 children participated, and 
a prevalence of DD of 5% could be detected (8 DD, 147 con- 
trol). There were 97 children in 2"% grade (3 DD, 94 control) 
and 58 children in 3"¢ grade (5 DD, 53 control). The DD di- 
agnosis was computed based on standardized tests assessing 
the mathematical abilities of the children [32, 7]. During the 
study, children solved 85 tasks directly in the classroom. On 
average, children needed 26 minutes to complete the tasks. 


Implementation. The unlabeled data set was used to train 
the unsupervised VAE for extracting compact feature em- 
beddings of the data. Based on the learnt data transforma- 
tions we evaluated two standard classifiers: Logistic Regres- 
sion (LR) and Naive Bayes (NB). We restricted our evalu- 
ation to simple classification models because we wanted to 
assess the quality of the feature embedding and not the qual- 
ity of the classifier. More advanced classifiers typically per- 
form a (sometimes implicit) feature transformation as part 
of their data fitting procedure. To represent at least one 
model that performs such an embedding we included Sup- 
port Vector Machine (SVM) in all our results. All classifier 
parameters were chosen according to the default values in 
scikit-learn. Note that we have additionally performed ran- 
domized cross-validated hyper-parameter search for all clas- 
sifiers, which, however, resulted in marginal improvements 
only. Because of that, and to keep the model simple and es- 
pecially easily reproducible, we use the default parameter set 
in this work. For Logistic Regression we used L2 regulariza- 
tion with C' = 1, for Naive Bayes we used Gaussian distribu- 
tions and for the SVM RBF kernels and data point weights 
have been set inversely proportional to label frequencies. All 
results are cross-validated using 30 randomized training-test 
splits on the unlabeled data (test size 5%). The classification 
part of the pipeline is additionally cross-validated using 300 
label-stratified random training-test splits (test size 20%) to 
ensure highly reproducible classification results. 


Network hyper-parameters were defined using the approach 
described in [1]. We increased the number of nodes per 
layer, the number of layers and the number of epochs until 
a good fit of the data was achieved. We then regularized 
the network using dropout [26] with increasing dropout rate 
until the network was no longer overfitting the data. Ac- 
tivation and weight initialization have been chosen accord- 
ing to common standards: We employ the most common 
activation function, namely rectified linear activation units 
(RELU) [20], for all activations. Weight initialization was 
performed using the method by He et al. [9]. Following this 
procedure, the following parameters were used for the S- 
SAE model: encoder and decoders used 3 layers of size 320. 
The CNN-SAE model was parametrized as follows: 3 convo- 
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lution layers with 64 convolution kernels and a filter length 
of 3. We used a single layer of LSTM cells with 80 nodes. 
We used a batch size of 500 samples and batch normaliza- 
tion and dropout (r = 0.25) at every layer. The warm-up 
phase (see Section 3) was set to 300 epochs. Training was 
stopped after 1000 (S-SAE) and 500 (CNN-SAE) epochs. 
The number of latent units was set to 8 in accordance to 
previous work on detecting students with DD that used 17 
features but found that about half of the features were suf- 
ficient to detect DD with high accuracy [17]. When feature 
selection was applied we set the number of features to K =4 
and thus we kept exactly half of the latent space features. 
All networks were implemented using the Keras framework 
with TensorFlow’™ and optimized using Adam stochastic 
optimization with standard parameters according to [14]. 


4.2 Performance comparison 

Our VAE models are trained to extract efficient feature em- 
beddings of the data. To assess the quality of these com- 
puted feature representations, we compare the classification 
performance of our method to previous techniques for find- 
ing efficient feature embeddings, as well as to feature sets 
optimized specifically for the task of predicting DD. 


Network comparison. In a first experiment we compared 
the feature embeddings generated by our simple S-SAE and 
our asymmetric CNN-SAE with and without feature selec- 
tion. Figure 3 illustrates the average ROC curves of our 
complete semi-supervised classification pipeline. Our fea- 
ture embeddings based on asymmetric CNN-SAE clearly 
outperform the ones from the simple S-SAE on both the 
imbalanced and the balanced data set for Naive Bayes (NB) 
and Logistic Regression (LR). For both models, feature se- 
lection improves the area under the ROC curve (AUC) for 
the imbalanced data set (CNN-SAE: LR 4.2%, NB 6.3%; 
S-SAE: LR 6.8%, NB: 1.6%), but has no effect for the bal- 
anced data set. We believe that this is due to the ability of 
the classifiers to distinguish useful features from noisy ones 
given enough samples. Since the performance of the clas- 
sifiers with feature selection (FS) is better or equal to no 
feature selection in each experiment, we used the CNN-SAE 
FS model for all further evaluations. 


Classification performance. In Figure 4 we compare the 
classifier performance for different feature embeddings. We 
compare our method based on VAE to two well-known meth- 
ods for finding optimal feature embeddings, namely principle 
component analysis (PCA, green) and Kernel PCA (KPCA, 
red) [24]. For comparison and as a baseline for the perfor- 
mance of the different methods, we include direct classifi- 
cation results (gray), for which no feature embedding was 
computed. We used K = 8 (dimensionality of feature em- 
bedding) for all methods. The features extracted by our 
pipeline compare favorably to PCA and Kernel PCA show- 
ing improvements in terms of AUC of 28% for Logistic Re- 
gression and 23% for Naive Bayes on the imbalanced data 
set and an improvement of 3.75% for Logistic Regression 
and 7.5% for Naive Bayes on the balanced data set. By 
using simple classifiers, we demonstrated that our encoder 
learns an effective feature embedding. More sophisticated 
classifiers (such as SVM with non-linear kernels) typically 
proceed by first embedding the input into a specific feature 
space that is different from the original space. 
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Figure 3: ROC curves for the two proposed mod- 
els with and without feature selection (FS). Our 
asymmetric CNN-SAE outperforms the simple S- 
SAE consistently with (blue) and without (purple) 
feature selection. Feature selection improves perfor- 
mance only on the imbalanced data set. 


For the imbalanced data set the overall performance for 
SVM is significantly lower for all embeddings. This is in line 
with previous work [12] showing that for imbalanced data 
sets, the decision boundaries of SVMs are heavily skewed 
towards the minority class resulting in a preference for the 
majority class and thus a high miss-classification rate for the 
minority class. Indeed, we found that SVM predicted only 
majority labels on the imbalanced data set. For the bal- 
anced data set our feature embedding shows improvements 
of 2.5% over alternative embeddings when using SVM. 


Further, Table 1 shows the performance of all feature embed- 
dings using three additional common classification metrics: 
root mean squared error (RMSE), classification accuracy 
(Acc.) and area under the precision recall curve (AUPR). 
We statistically compared the classification metrics of our 
feature embedding to the best alternative feature embed- 
ding using an independent t-test and Bonferroni correction 
for multiple tests (a = 0.05). Our feature embedding signif- 
icantly outperformed alternative embeddings for all classi- 
fiers on both the balanced and imbalanced data sets on most 
metrics. The main exception was the performance of SVM 
on the imbalanced data set, which exhibited large variance 
for all feature embeddings and the worst overall classifica- 
tion performance (compared to the other classifiers). 


When comparing classification performance on the imbal- 
anced and the balanced data sets we observed that our 
pipeline using VAEs showed significant performance improve- 
ments compared to other methods for finding feature embed- 
dings. While the unlabeled and the balanced data sets stem 
from an adaptive version of Calcularis the imbalanced data 
was collected using a fixed task sequence. As our method 
shows larger improvements on the imbalanced data, we be- 
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lieve CNN-SAE learned an embedding that is robust beyond 
adaptive ITS. The relative improvements of our feature em- 
beddings is smallest for SVM on the balanced data set. We 
believe that this is due to ability of the SVM to learn com- 
plex decision boundaries given sufficient data. However, the 
ability for complex decision boundaries renders SVMs more 
vulnerable to class imbalance, yielding performance at ran- 
dom level on the imbalanced data set. 


Comparison to specialized models. Recently, a spe- 
cialized Naive Bayes classifier (S-NB) for the detection of 
developmental dyscalculia (DD) was introduced presenting 
a set of features optimized for the detection of DD [17]. 
The development of S-NB including the set of features was 
based on the balanced data set used in this work. In com- 
parison to S-NB, our approach relies on timing data only 
and the extracted features are independent of the classifi- 
cation task. We compared the performance of S-NB to our 
CNN-SAE model on both data sets. For the balanced data 
set we found an AUC of 0.94 for the specialized model (S- 
NB) compared to an AUC of 0.86 for Naive Bayes using our 
feature embedding. On the imbalanced data set we found 
an AUC of 0.67 for S-NB compared to an AUC of 0.77 us- 
ing Logistic Regression with our feature embedding. These 
findings demonstrate that while our feature embedding per- 
forms slightly worse on the balanced data set (for which the 
S-NB was developed), we significantly outperform S-NB by 
15% on the imbalanced data set, which suggests that our 
VAE model automatically extracts feature embeddings that 
are more robust than expert features. 


Robustness on sample size. Ideally, a classifier’s perfor- 
mance should gracefully decrease as fewer data is provided. 
A good feature embedding allows a classifier to generalize 
well based on few labeled examples because similar samples 
are clustered together in the feature embedding. We there- 
fore investigated the robustness of the different feature rep- 
resentations with respect to the training set size. For this we 
used the balanced data set where we varied the training set 
size between 7 (10% of the data) and 62 (90% of the data) 
by random label-stratified sub-sampling. Figure 5 compares 
the AUC of the different feature embeddings over different 
sizes of the training set. In case of Naive Bayes and Logis- 
tic Regression our embedding provides superior performance 
for all training set sizes. For large enough data sets SVM 
using the raw feature data (Direct, grey) is performing as 
well as using our embedding (CNN-SAE, blue). However, 
for smaller data sets starting at 30 samples the performance 
of SVM based on the raw features declines more rapidly 
compared to the SVM based on our feature embedding. 


5. CONCLUSION 


We adapted the recently developed variational auto-encoders 
to educational data for the task of semi-supervised clas- 
sification of student characteristics. We presented a com- 
plete pipeline for semi-supervised classification that can be 
used with any standard classifier. We demonstrated that ex- 
tracted structures from large scale unlabeled data sets can 
significantly improve classification performance for different 
labeled data sets. Our findings show that the improvements 
are especially pronounced for small or imbalanced data sets. 
Imbalanced data sets typically arise in EDM when detecting 
relatively rare conditions such as learning disabilities. Im- 
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Figure 4: Classification performance for different feature embeddings. Our variational auto-encoder (blue) 
outperforms other embeddings by up to 28% (imbalanced data set) and by up to 7.5% (balanced data set). 
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Figure 5: Comparison of classifier performance on the balanced data for different training set sizes (moving 
average fitted to data points). The features automatically extracted by our variational auto-encoder (blue) 
maintain a performance advantage even if the training size shrinks to 7 samples (10% of the original size). 


Table 1: Comparison of our method to alternative embeddings. Our approach using a variational auto-encoder 
(CNN-SAE) significantly outperforms other approaches for most cases. The best score for each metric and 
classifier is shown in bold. *= statistically significant difference (t-test with Bonferroni correction, a = 0.05). 


PCA Kernel PCA CNN-SAE 
AUC RMSE AUPR Acc. AUC RMSE AUPR Acc. AUC RMSE AUPR_ Acc. 


Direct 
AUC RMSE AUPR Acc. 


Imbalanced data set 


Logistic Regression 0.53 0.27 0.18 0.91 0.54 0.25 OF 0.93 0.61 0.25 0.16 0.93 0.78* 0.24* 0.28* 0.94* 
Naive Bayes 0.51 0.29 0.23 0.91 0.50 0.29 0.10 0.90 0.57 0.28 0.20 0.91 0.70* 0.25* 0.24 0.93* 
SVM 0.55 0.25 0.22* 0.94 0.40 0.25 0.08 0.94 0.42 0.25 0.09 0.93 0.59 0.25 0.16 0.94 
Balanced data set 

Logistic Regression 0.80 0.44 0.82 0.73 0.80 0.42 0.84 0.73 0.80 0.42 0.83 0.75 0O.83* 0.40* 0.84 0.77 
Naive Bayes 0.80 0.49 0.80 0.73 0.77 0.46 O77 0.71 0.76 0.46 0.76 0.70 0O.86* 0.38* 0.86*  0.80* 
SVM 0.81 0.42 0.84* 0.75 0.79 0.43 0.81 0.73 0.80 0.43 0.83 0.73 0.83 0.40* 0.81 0.79* 


proved classification results with simple classifiers such as 
Logistic Regression might indicate that VAEs learn feature 
embeddings that are interpretable by human experts. In 
the future we want to explore the learnt representations and 
compare it to traditional categorizations of students (skills, 
performance, etc.). Additionally, we want to extend our 
results to include additional feature types and data reliabil- 
ity indicators to handle missing data. Although we trained 
our networks on comparatively small sample sizes, the pre- 
sented method scales (due to mini-batch learning) to much 
larger data sets (>100K users ) allowing the training of more 
complex VAE. Moreover, the generative model pg(x|z) that 
is part of any VAE can be used to produce realistic data 
samples [29]. Up-sampling of the minority class provides a 
potential way to improve the decision boundaries for classi- 
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fiers. In contrast to common up-sampling methods such as 
ADASYN [8], VAE-based sampling does not require nearest 
neighbor computations which makes them better applicable 
to small data sets. Preliminary results for random subsets 
of the balanced data set showed improvements in AUC by 
up-sampling based on VAE of 2-3% compared to ADASYN. 
While we applied our method to the specific case of detecting 
developmental dyscalculia, the presented pipeline is generic 
and thus can be applied to any educational data set and 
used for the detection of any student characteristic. 
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ABSTRACT 


We show how the novel use of a semantic representation 
based on Osgood’s semantic differential scales can lead to 
effective features in predicting short- and long-term learning 
in students using a vocabulary learning system. Previous 
studies in students’ intermediate knowledge states during 
vocabulary acquisition did not provide much information 
on which semantic knowledge students gained during word 
learning practice. Moreover, these studies relied on human 
ratings to evaluate the students’ responses. ‘To solve this 
problem, we propose a semantic representation for words 
based on Osgood’s semantic decomposition of vocabulary 
[16]. To demonstrate our method can effectively represent 
students’ knowledge in vocabulary acquisition, we build 
models for predicting the student’s short-term vocabulary 
acquisition and long-term retention. We compare the 
effectiveness of our Osgood-based semantic representation to 
that provided by Word2Vec neural word embedding [13], and 
find that prediction models using features based on Osgood 
scale-based scores (OSG) perform better than the baseline 
and are comparable in accuracy to those using Word2Vec 
score-based models (W2V). By using more interpretable 
Osgood-based scales, our study results can help with better 
understanding of students’ ongoing learning states and 
designing personalized learning systems that can address an 
individual’s weak points in vocabulary acquisition. 


Keywords 
Vocabulary learning, semantic similarity, prediction model, 
intelligent tutoring system 


1. INTRODUCTION 


Studies of word learning have shown that knowledge of 
individual words is typically not all-or-nothing. Rather, 
people acquire varying degrees of knowledge of many words 
incrementally over time, by exposure to them in context [9]. 
This is especially true for so-called “academic” words that are 
less common and more abstract — e.g., pontificate, probity, 
or assiduous |7|. Binary representations and measures model 
word knowledge simply as correct or incorrect on a particular 
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item (word), but in reality, a student’s knowledge level may 
reside between these two extremes. Thus, previous studies of 
vocabulary acquisition have suggested that students’ partial 
knowledge be modeled using a representation that adding an 
additional label corresponding to an intermediate knowledge 
state [6] or further, in terms of continuous metrics for 
semantic similarity [3]. 


In addition, there are multiple dimensions to a word’s 
meaning [16]. Measuring a student’s partial knowledge on 
a single scale may only provide abstract information about 
the student’s general answer quality and not give enough 
information to specify which dimensions of word knowledge 
a student already has learned or needs to improve. In order 
to achieve detailed understanding of a student’s learning 
state, online learning systems should be able to capture 
a student’s “learning trajectory” that tracks their partial 
knowledge on a particular item over time, over multiple 
dimensions of meaning in a multidimensional semantic 
representation. 


Hence, multidimensional representations of word knowledge 
can be an important element for building an effective 
intelligent tutoring system (ITS) for reading and language. 
Maintaining a fine-grained semantic representation of a 
student’s degree of word knowledge can be helpful for 
the ITS to design more engaging instructional content, 
more helpful personalized feedback, and more sensitive 
assessments [17, 19]. Selecting semantic representations 
to model, understand, and predict learning outcomes is 
important to designing a more effective and efficient ITS. 


In this paper, we explore the use of multidimensional 
semantic word representations for modeling and predicting 
short- and long-term learning outcomes in a vocabulary 
tutoring system. Our approach derives predictive 
features using a novel application of existing methods in 
cognitive psychology combined with methods from natural 
language processing (NLP). First, we introduce a new 
multidimensional representation of a word based on the 
Osgood semantic differential [16], an empirically based, 
cognitive framework that uses a small number of scales 
to represent latent components of word meaning. We 
compare the effectiveness of model features based on this 
Osgood-based representation to features based on a different 
representation, the widely-used Word2Vec word embedding 
[13]. Second, we evaluate our prediction models using 
data from a meaning-generation task that was conducted 
during a computer-based intervention. Our study results 
demonstrate how similarity-based metrics based on rich 
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semantic representation can be used to automatically 
evaluate specific components of word knowledge, track 
changes in the student’s knowledge toward the correct 
meaning, and compute a rich set of features for use in 
predicting short- and long-term learning outcomes. Our 
methods could support advances in real-time, adaptive 
support for word semantic learning, resulting in more 
effective personalized learning systems. 


2. RELATED WORK 

The present study is informed by three areas of research: 
(1) studies of partial word knowledge; (2) the Osgood 
framework for multiple dimensions of word meaning, and (3) 
computational methods for estimating semantic similarity. 


Partial Word Knowledge. The concept of partial word 
knowledge has interested vocabulary researchers for several 
decades, particularly in the learning and instruction of “Tier 
2” words [20]. Tier 2 words are low-frequency and typically 
have complex (multiple, nuanced) meanings. By nature, 
they are rarely learned through “one-shot” learning or direct 
definition. Instead, they are learned partially and gaps are 
filled in over time. 


Words in this intermediate state, neither novel nor fully 
known, are sometimes called “frontier words” [5]. Durso 
and Shore operationalized the frontier word as a word the 
student had seen previously but was not actively using it [6]. 
Based on this definition, the student may have had implicit 
memory of frontier words, such as general information like 
whether the word indicates a good or bad situation or refers 
a person or an action. They discovered that students are 
more familiar with frontier words than other types of words 
in terms of their sounds and orthographic characteristics [6]. 
This previous work suggested that the concept of frontier 
words can be used to represent a student’s partial knowledge 
states in a vocabulary acquisition task [5, 6]. 


In some studies, partial word knowledge has been 
represented using simple, categorical labels, e.g., multiple- 
choice tests that include “partially correct” response options, 
as well as a single “best” (correct) response. In other studies, 
the student is presented with a word and is asked to say 
what it means [1]. The definition is given partial credit 
if it reflects knowledge that is partial or incomplete. For 
example, a student may recognize that the word probity 
has a positive connotation, even if she cannot give a 
complete definition. However, single categorical or score- 
based indicators may not explain which specific aspects of 
vocabulary knowledge the student is missing. Moreover, 
these studies relied on human ratings to evaluate students’ 
responses for unknown words [6]. Although widely used 
in psychometric and psycholinguistic studies [4, 16], hiring 
human raters is expensive and may not be done in real time 
during students’ interaction with the tutoring system. 


To address these problems, we propose a data-driven method 
that can automatically extract semantic characteristics of 
a word based on a set of relatively simple, interpretable 
scales. The method benefits from existing findings in 
cognitive psychology and natural language processing. In 
the following sections, we illustrate more details of related 
findings and how they can be used in an intelligent tutoring 
system setting. 
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Semantic Representation & the Osgood Framework. 
To quantify the semantic characteristics of a student’s 
intermediate knowledge of vocabulary, this paper uses a 
“spatial analogue” for capturing semantic characteristics of 
words. In [16], Osgood investigated how the meaning of 
a word can be represented by a series of general semantic 
scales. By using these scales, Osgood suggested that the 
meanings of any word can be projected and explored in a 
continuous semantic space. 


Osgood asked human raters to evaluate a set of words using a 
large number of scales (e.g., tall-short, fat-thin, heavy-light) 
and captured the semantic representation of a word [16]. 
Respondents gave Likert ratings, which indicated whether 
they thought that a word meaning was closer to one extreme 
(-3) or the other (+3), or basically irrelevant (0). A principal 
components analysis (PCA) was used to represent the latent 
semantic features that can explain the patterns of response 
to individual words within this task. 


In our study, we suggest a method that can automatically 
extract similar semantic information that can project a word 
into a multidimensional semantic space. By using semantic 
scales selected from [16], we verify if such representation of 
semantic attributes of words is useful for predicting students’ 
short- and long-term learning. 


Semantic Similarity Measures. Studies in NLP have 
suggested methods to automatically evaluate the semantic 
association between two words. For example, Markov 
Estimation of Semantic Association (MESA) [3, 9] can 
estimate the similarity between words from a random walk 
model over a synonym network such as WordNet [14]. Other 
methods like latent semantic analysis (LSA) are based on 
co-occurrence of the word in a document corpus. In LSA, 
semantic similarity between words is determined by using 
a cosine similarity measure, derived from a sparse matrix 
constructed from unique words and paragraphs containing 
the words [10]. 


For this paper, we use Word2Vec [13], a widely used word 
embedding method, to calculate the semantic similarity 
between words. Word2Vec’s technique [11] transforms the 
semantic context, such as proximity between words, into a 
numeric vector space. In this way, linguistic regularities 
and patterns are encoded into linear translations. For 
example, using outputs from Word2Vec, relationships 
between words can be estimated by simple operations on 
their corresponding vectors, e.g., Madrid - Spain + France 
= Paris, or Germany + capital = Berlin [13]. 


Measures from these computational semantic similarity tools 
are powerful because they can provide an automated method 
for evaluation of partial word knowledge. However, they 
typically produce a single measure (e.g., cosine similarity or 
Euclidean distance), representing semantic similarity as a 
one-dimensional construct. With such a measure, it is not 
possible to determine represent partial semantic knowledge 
and changes in knowledge of latent semantic features as 
word knowledge progresses from unknown to frontier to 
fully known. In following sections, we describe how we 
address this problem, using novel methods to to estimate 
the contribution of Osgood semantic features to individual 
word meanings. 


Sl 


2.1 Overview of the Study 


Based on findings from existing studies, this study will 
suggest an automatized method for evaluating students’ 
partial knowledge of vocabulary that can be used to predict 
students’ short-term vocabulary acquisition and long-term 
retention. ‘To investigate this problem, we will answer the 
following research questions with this paper. 


The first research question (RQ1): Can semantic similarity 
scores from Word2Vec be used to predict students’ short- 
term learning and long-term retention? Previous studies in 
vocabulary tutoring systems tend to focus on how different 
experimental conditions, such as different spacing between 
question items [18], difficulty levels [17], and systematic 
feedback [7], affect students’ short-term learning. This study 
will answer how computationally estimated trial-by-trial 
scores in a vocabulary tutoring system can be used to predict 
students’ short-term learning and long-term retention. 


RQ2: Compared to using regular Word2Vec scores, how does 
the model using Osgood’s semantic scales [16] as features 
perform for immediate and delayed learning prediction 
tasks? As described in the previous section, the initial 
outcome from Word2Vec returns hundreds of semantic 
dimensions to represent the semantic characteristics of 
a word. Summary statistics for comparing such high- 
dimensional vectors, such as cosine similarity or Euclidean 
distance, only provide the overall similarity between words. 
If measures from Osgood scales work in a similar level 
to models using regular Word2Vec scores for predicting 
students’ learning outcomes, we can argue that it can 
be an effective method for representing students’ partial 
knowledge of vocabulary. 


3. METHOD 
3.1 Word Learning Study 


This study used a vocabulary tutoring system called 
Dynamic Support of Contextual Vocabulary Acquisition 
for Reading (DSCoVAR) [8]). DSCoVAR aims to support 
efficient and effective learning vocabulary in context. All 
participants accessed DSCoVAR in a classroom-setting 
environment by using Chromebook devices or the school’s 
computer lab in the presence of other students. 


3.1.1 Study Participants 

Participants included 280 middle school students (6th to 
8th grade) from multiple schools, including children from 
diverse socio-economic and educational backgrounds. ‘Table 
1 provides a summary of student demographics, including 
location (P1 or P2), age and grade level, sex. Location P1 is 
a laboratory school affiliated with a large urban university in 
the northeastern United States. Students from location P1 
were generally of high socio-economic status (e.g., children 
of University faculty and staff). Location P2 includes three 
public middle schools in a southern metropolitan area of the 
United States. All students from location P2 qualified for 
free or reduced lunch. The study included a broad range of 
students so that the results of this analysis were more likely 
to generalize to future samples. 


3.1.2 Study Materials 

DSCoVAR presented students with 60 SAT-level English 
words (also known as Tier 2 words). These “target words,” 
lesser-known words that the students are going to learn, 
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Table 1: The number of participants by grade and 
gender 


[th grade [7th grade [Sth grade] 


Pa 16 28 19 23 18 13 
P2 53 51 12 6 21 20 


were balanced between different parts of speech, including 20 
adjectives, 20 nouns, and 20 verbs. Based on previous works, 
we expected that students would have varying degrees of 
familiarity with the words at pre-test, but that most words 
would be either completely novel (“unknown”) or somewhat 
familiar (“partially known”) [8, 15]. This selection of 
materials ensured that there would be variability in word 
knowledge across students for each word and across words 
for each student. 


In DSCoVAR, students learned how to infer the meaning 
of an unknown word in a sentence by using surrounding 
contextual information. Having more information in a 
sentence (i.e., a sentence with a high degree of contextual 
constraint) can decrease the uncertainty of inference. In 
this study, the degree of sentence constraint was determined 
using standard cloze testing methods: quantifying the 
diversity of responses from 30 human judges when the target 
word is left as a fill-in-the-blank question. 


3.1.3 Study Protocol 


The word learning study comprised four parts: (1) a pre- 
test, which was used to estimate baseline knowledge of 
words, (2) a training session, where learners were exposed to 
words in meaningful contexts, (3) an immediate post-test, 
and (4) a delayed post-test, which occurred approximately 
one week after training. 


Pre-test. The pre-test session was designed to measure 
the students’ prior knowledge of the target words. For 
each target word, students were asked to answer two types 
of questions: familiarity-rating questions and synonym 
selection questions. In familiarity rating questions, students 
provided their self-rated familiarity levels (unknown, known, 
and familiar) for presented target words. In synonym- 
selection questions, students were asked to select a synonym 
word for the given target word from five multiple choice 
options. The outcome from synonym-selection questions 
provided more objective measures for students’ prior domain 
knowledge of target words. 


Training. Approximately one week after the pre-test 
session, students participated in the training. During 
training, students learned strategies to infer the meaning 
of an unknown word in a sentence by using surrounding 
contextual information. 


A training session consisted of two parts: an instruction 
video and practice questions. In the instruction video, 
students saw an animated movie clip about how to identify 
and use contextual information from the sentence to infer 
the meaning of an unknown word. In the practice question 
part, students could exercise the skill that they learned from 
the video. DSCoVAR provided sentences that included a 
target word with different levels of surrounding contextual 
information. The amount of contextual information for 
each sentence was determined by external crowd workers 
(details described in Section 3.1.2). In the practice question 
part, each target word was presented four times within 


82 


different sentences. Students were asked to type a synonym 
of the target word, which was presented in the sentence as 
underlined and bold. Over two weeks, students participated 
in two training sessions with a week’s gap between them. 
Each training session contained the instruction video and 
practice questions for 30 target words. An immediate post- 
test session followed right after each training session. 


Figure 1: An example of a training session question. 
In this example, the target word is “education” with 
a feedback message for a high-accuracy response. 


| go to school because | want to get a good education. 


Please enter ONE word that has the same meaning as the word Tht to convert 


education 


school | 


If you do not know the answer, make your best guess. If you can't think of an exact synonym, enter a 
word with a closely related meaning. 


Students were randomly selected to experience different 
instruction video conditions (full instruction video vs. 
restricted instruction video). Additionally, various difficulty 
level conditions and feedback conditions (e.g., DSCoVAR 
provides a feedback message to the student based on answer 
accuracy vs. no feedback) were tested within the same 
student. However, in this study, we focused on data 
from students who experienced a full instruction video 
and repeating difficulty conditions. Repeating difficulty 
conditions included questions with all high or medium 
contextual constraint levels. By doing so, we wanted to 
minimize the impact from various experimental conditions 
for analyzing post-test outcomes. Moreover, we filtered out 
response sequences that did not include all four responses 
for the target word. As a result, we analyzed 818 response 
sequences from 7,425 items in total. 


Immediate and Delayed Post-test. The immediate 
post-test occurred right after the students finished the 
training; the delayed post-test was conducted one week later. 
Data collected during the immediate and delayed post- 
tests were used to estimate short-and long-term learning, 
respectively. Test items were identical to those in the pretest 
session, except for item order, which varied across tests. For 
analysis of the delayed post-test data, we only used the data 
from target words for which the student provided a correct 
answer in the earlier, immediate post-test session. As a 
result, 449 response sequences were analyzed for predicting 
the long-term retention. 


3.2 Semantic Score-Based Features 
We now describe the semantic features tested in our 
prediction models. 


3.2.1 Semantic Scales 

For this study, we used semantic scales from Osgood’s study 
[16]. Ten scales were selected by a cognitive psychologist as 
being considered semantic attributes that can be detected 
during word learning (Figure 2). Each semantic scale 
consists of pairs of semantic attributes. For example, the 
bad—good scale can show how the meaning of a word can 
be projected on a scale with bad and good located at either 
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Figure 2: Ten semantic scales used for projecting 
target words and responses [16]. 


e bad — good e complex — simple 
@ passive — active e fast — slow 

e powerful — helpless e noisy — quiet 

e big — small e new — old 

e helpful — harmful e healthy — sick 


end. The word’s relationship with each semantic anchor can 
be automatically measured from its semantic similarity with 
these exemplar semantic elements. 


3.2.2 Basic Semantic Distance Scores 

To extract meaningful semantic information, we have 
applied the following measures that can be used to explain 
various characteristics of student responses for different 
target words. In this study, we used a pre-trained model 
for Word2Vec,' built based on the Google News corpus 
(100 billion tokens with 3 million unique vocabularies, 
using a negative sampling algorithm), to measure semantic 
similarity between words. ‘The output of the pre-trained 
Word2Vec model contained a numeric vector with 300 
hundred dimensions. 


First, we calculated the relationship between word pairs (i.e., 
a single student response and the target word, or a pair of 
responses) in both the regular Word2Vec (W2V) score and 
the Osgood semantic scale (OSG) score. In the W2V score, 
the semantic relationship between words was represented 
with a cosine similarity between word vectors: 


Dw2v(W1, W2) = 1 — |sim(V(w1), V(wa))]. (1) 


In this equation, the function V returned the vectorized 
representation of the word (wi or w2) from the pre-trained 
Word2Vec model. By calculating the cosine similarity 
between two vectors (a cosine similarity function is noted 
as sim), we could extract a single numeric similarity score 
between two words. ‘This score was converted into a 
distance-like score by taking the absolute value of the cosine 
similarity score and subtracting from one. 


For the OSG score, we extracted two different types of 
scores: a non-normalized score and a normalized score. A 
non-normalized score showed how a word is similar to a 
single anchor word (e.g., bad or good) from the Osgood scale. 


Sosg (W, Gi,j) = sim(V(w), V(ai,5)) (2) 


Deeg (Wiss i,j) — [Sceqg (WigGi )| _ |Sosg (Wa, 43,5) | (3) 


In equation 2, a;,; represents a single anchor word (j) in 
the 7-th Osgood scale. The similarity between the anchor 
word and the evaluating word w was calculated with cosine 
similarity of Word2Vec outcomes for both words. In a non- 
normalized setting, the distance between two words given 
by a particular anchor word was calculated by the difference 
of absolute cosine similarity scores (equation 3). 


The second type of OSG score is a normalized score. By 
using Word2Vec’s ability to do arithmetical calculation of 


‘API and pre-trained model for Word2Vec was downloaded 
from this URL: https://github.com/3Top/word2vec-api 
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multiple word vectors, the normalized OSG score provided 
a relative location of the word from two anchor ends of the 
Osgood scale. 


Sosg (Ww, ai) = sim(V(w), V(ai1) — V(ai,2)) (4) 


— Sosg (w2,ai)| (5) 


In equation 4, the output represents the cosine similarity 
score between the word w and two anchor words (ai,1 
and ai,2). For example, if the cosine similarity score of 
Sosg_ (Ww, ai) is close to -1, it means the word w is close to 
the first anchor word a;,1. If the score is close to 1, it is vice 
versa. In equation 5, the distance between two words was 
calculated as the absolute value of the difference between 
two cosine similarity measures. 


Dosg (W1, W23 Gi) = (Seg ina) 


3.2.3. Deriving Predictive Features 

Based on semantic distance equations explained in the 
previous section, this section explains examples of predictive 
features that we used to predict students’ short-term 
learning and long-term retention. 


Distance Between the Target Word and_ the 
Response. For regular Word2Vec score models and Osgood 
scale score models, distance measures between the target 
word and the response (by using equations 1 and 5) were 
used to estimate the accuracy of the response to a question. 
This feature represents the trial-by-trial answer accuracy of 
a student response. Each response sequence for the target 
word contained four distance scores. 


Difference Between Responses. Another feature that 
we used in both types of models was the difference between 
responses. This feature could capture how a student’s 
current answer is semantically different from the previous 
response. From each response sequence, we could extract 
three derivative scores from four responses. 


Convex Hull Area of Responses. Alternative to 
the difference between responses feature, Osgood scale 
models were also tested with the area size of convex hull 
that can be generated by responses calculated with non- 
normalized Osgood scale scores (equation 3). For example, 
for each Osgood scale, a non-normalized score provided 
two-dimensional scores that can be used for geometric 
representation. By putting the target word in an origin 
position, a sequence of responses can create a polygon 
that can represent the semantic area that the student 
explored with responses. Since some response sequences 
were unable to generate the polygon by including less than 
three unique responses, we added a small, random noise 
that uniformly distributed (between —10~* and 10~*) to all 
response points. Additionally, a value of 107 °° was added to 
all convex hull area output to create a visible lower-bound 
value. 


Unlike the measure of difference between responses, this 
feature also considers angles that can be created between 
responses and the target word. ‘This representation can 
provide more information than just using difference between 
responses. 


3.3. Modeling 


To predict students’ short-term learning and long-term 
retention, we used a mixed-effect logistic regression model 
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(MLR). MLR is a general form of logistic regression model 
that includes random effect factors to capture variations 
from repeated measures. 


3.3.1 Off-line Variables 


Off-line variables capture item- or subject-level variances 
that can be observed repeatedly from the data. In this study, 
we used multiple off-line variables as random effect factors. 


First, results from familiarity-rating and synonym-selection 
questions from the pre-test session were used to include 
item- and subject-level variances. Both variables include 
information on the student’s prior domain knowledge level 
for target words. Second, the question difficulty condition 
was considered as an item group level factor. In the analysis, 
sentences for the target word that were presented to the 
student contained the same difficulty level, either high or 
medium contextual constraint levels, over four trials. Third, 
a different experiment group was used as a subject group 
factor. As described in Section 3.1.1, this study contains 
data from students in different institutions in separate 
geographic locations. ‘The inclusion of these participant 
groups in the model can be used to explain different 
short-term learning outcomes and long-term retention by 
demographic groups. 


3.3.2 Model Building 


In this study, we compared the performance of MLR models 
with four different feature types. First, the baseline model 
was set to indicate the MLR model’s performance without 
any fixed effect variables but only with random intercepts. 
Second, the response time model was built to be compared 
with semantic score-based models. Many previous studies 
have used response time as an important predictor of student 
engagement and learning [2, 12]. In this study, we used two 
types of response time variables, the latency for initiating 
the response and finishing typing the response, as predictive 
features. Both variables were measured in milliseconds over 
four trials and natural log transformed for the analysis. 
Third, semantic features from regular Word2Vec scores were 
used as predictors. ‘This model was built to show how 
semantic scores from Word2Vec can be useful for predicting 
students’ short- and long-term performance in DSCoVAR. 
Lastly, Osgood scale-based features were used as predictors. 
This model was compared with the regular Word2Vec score 
model to examine the effectiveness of using Osgood scales for 
evaluating students’ performance in DSCoVAR. For these 
semantic-score based models, we tested out different types 
of predictive features that were described in Section 3.2.3. 
All models shared the same random intercept structure 
that treated each off-line variable as an individual random 
intercept. 


For Osgood scale models, we also derived reduced-scale 
models. Reduced-scale models were compared with the full- 
scale model, which uses all ten Osgood scales. In this case, 
using fewer Osgood scales can provide easier interpretation 
of semantic analysis for intelligent tutoring system users. 


3.3.3 Model Evaluation 


To compare performance between different models, this 
study used various evaluation metrics, including AUC (an 
area under the curve score from a response operating 
characteristic (ROC) curve), Fi (a harmonic mean of 
precision and recall), and error rate (a ratio of the number of 
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misclassified items over total items). 95% confidence interval 
of each evaluation metric was calculated from the outcome of 
a ten-fold cross-validation process repeated over ten times. 


To select the semantic score-based features for models based 
on regular Word2Vec scores and Osgood scale scores, we 
used rankings from each evaluation metric. The model with 
the highest overall rank (i.e., sum the ranks from AUC, F}, 
and error rate, and select the model with the lowest rank- 
sum value) was considered the best-performing model for 
the score type (i.e., models based on the regular Word2Vec 
score or Osgood scale score). More details on this process 
will be illustrated in the next section. 


4. RESULTS 
4.1 Selecting Models 


In this section, we selected the best-performing model based 
on the models’ overall ranks in each evaluation metric. All 
model parameters were trained in each fold of repeated 
cross-validation. We calculated 95% confidence intervals for 
comparison. To calculate the confidence interval of F, and 
error rate measures, the maximum (F) and minimum (error 
rate) scores of each fold were extracted. These maximum 
and minimum values were derived from applying multiple 
cutoff points to the mixed-effect regression model. 


4.1.1 Predicting Immediate Learning 

First, we built models that predict the students’ immediate 
learning from the immediate post-test session. From 
models based on regular Word2Vec scores (W2V), the model 
with the distance between the target and responses and 
the difference between responses (Dist+Resp) provided the 
highest rank from various evaluation metrics (Table 2). 
From models based on Osgood scales (OSG), the model with 
the difference between responses (Resp) provided the highest 
rank. 


The selected W2V model provided significantly better 
performance than the baseline model. The selected OSG 
model also showed significantly better performance than the 
baseline model, except for the AUC score. The selected 
W2V model was significantly better than the model using 
response time features in the AUC score and error rates. 


The selected W2V model showed significantly better 
performance than the OSG model only with the AUC score. 
Figure 3 shows that the W2V model has a slightly larger area 
under the ROC curve than the OSG model. In the precision 
and recall curve, the selected W2V model provides more 
balanced trade-offs between precision and recall measures. 
The selected OSG model outperforms the W2V model in 
precision only in a very low recall measure range. 


4.1.2 Predicting Long-Term Retention 

We also built prediction models to predict the students’ 
long-term retention in the delayed post-test session. In 
this analysis, a student response was included only when 
the student provided correct answers to the immediate 
post-test session questions. Among W2V  score-based 
models, the best-performing model contained the same 
feature types as the immediate post-test results (Table 3). 
By using the distance between the target and responses 
and difference between responses (Dist+Resp), the model 
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achieved significantly better performance than the baseline 
model, except for the AUC score. 


For OSG models, the model with a convex hull area of 
responses (Chull) provided the highest overall rank from 
evaluation metrics (Table 3). The results were significantly 
better than the baseline model, and marginally better than 
the W2V model. Both selected W2V and OSG models were 
marginally better than the response time model, except the 
error rate of the OSG model was significantly better. 


In Figure 3, the selected W2V model slightly outperforms 
the OSG model in mid-range true positive rates, while 
the OSG model performed slightly better in a higher true 
positive area. Precision and recall curves show similar 
patterns to those we observed from the immediate post-test 
prediction models. The OSG model only outperforms the 
W2V model in a very low recall value area. 


4.1.3 Comparing Models 

Compared to the selected W2V model in the immediate 
post-test condition, the selected W2V model in the delayed 
post-test retention condition showed a significantly lower 
AUC score, marginally higher F; score, and marginally 
higher error rate. In terms of OSG models, the selected OSG 
model for delayed post-test retention showed a significantly 
better F score and error rates than the selected OSG model 
in the immediate post-test condition. Based on these results, 
we can argue that Osgood scale scores can be more useful for 
predicting student retention in the delayed post-test session 
than predicting the outcome from the immediate post-test. 


In terms of selected feature types, the best-performing 
OSG models used features based on the difference between 
responses (Resp) or the convex hull area (Chull) that was 
created from the relative location of the responses. On the 
other hand, selected W2V models used both the distance 
between the target word and responses and difference 
between responses (Dist+Resp). When we compared 
both W2V and OSG models using the difference between 
responses feature, we found that performance is similar in 
the immediate post-test data. However, the OSG model 
was significantly better in the delayed post-test data. These 
results show that Osgood scale scores can be more useful for 
representing the relationship among response sequences. 


4.2 Comparing the Osgood Scales 

To identify which Osgood scales are more helpful than 
others for predicting students’ performance, we conducted 
a scale-wise importance analysis. The results from this 
section reveal which Osgood scales are more important than 
others, and how the performance of prediction models with 
a reduced number of scales is comparable with the full-scale 
model. 


4.2.1 Identifying More Important Osgood Scales 

In this section, based on the selected Osgood score model 
from Section 4.1, we identified the level of contribution for 
features based on each Osgood scale. For example, the 
selected OSG model for predicting the immediate post-test 
data uses the difference between responses in ten Osgood 
scales as features. In order to diagnose the importance level 
of the first scale (bad-good), we can retrain the model with 
features based on the nine other scales and compare the 
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Table 2: Ranks of predictive feature sets for regular Word2Vec models (W2V) and Osgood score models 
(OSG) in the immediate post-test data. All models are significantly better than the baseline model. (Bold: 


the selected model with highest overall rank.) 
W2V models 


baseline 0.68 |0.67, 
RT 0.68, 
Dist 0.72 [0.71, 0.74] (1) 


0.76 
Resp 0.70 [0.69, 0.71] (3) |0.75 
Chull NA NA 


5) [0.33 [0.33, 0. 67, 0.69] (5 
(0.75, 0.76] (3) |0.31 [0.31, 0.32] (4) |0.69 [0.68, 0.70] (2) 
(0.75, 0.76] (2) |0.29 [0.28, 0.30] (2) |0.67 [0.66, 0.68] (7) |0.73 [0.73, 0.74] (7) |0.33 [0.32, 0.34] (6) 
(0.74, 0.76] (4) |0.31 [0.30, 0.32] (3) 


OSG models 
0.67, 0.69 5 0.33 |0.33, 0.34] (7 
[0.74, 0.76] (2) |0.31 [0.31, 0.32] (2) 


0.74 [0.73, 0.75] (4) |0.32 [0.31, 0.33] (4) 
0.74 [0.73, 0.75] (3) {0.31 [0.31, 0.32] (3) 


NA NA 


0.74 [0.73, 0.74] (6) |0.33 [0.32, 0.34] (5) 


Table 3: Ranks of predictive feature sets for W2V and OSG models in the delayed post-test data. All models 
are significantly better than the baseline model. (Bold: the selected model with highest overall rank.) 


W2V models 


0.75 |0.74, 0.76] (5 


OSG models 


0.75 0.74, 0.76 


64, 0.67] (5) (0. 0.33 [0.32, 0. 64, 0.67] (5) 0. 7) |0.33 [0.32, 0.34] (7 
0.67 [0.65, 0.68] (3) |0.76 [0.76, 0.77] (4) 0.31 [0.30, 0.32] (3) |0.67 [0.65, 0.68] (3) |0.76 [0.76, 0.77] (5) |0.31 [0.30, 0.32] (5) 
0.66 [0.64, 0.68] (4) |0.77 [0.76, 0.78] (3) 0.31 [0.30, 0.32] (4) |0.66 [0.64, 0.68] (4) |0.78 [0.77, 0.79] (3) |0.30 [0.29, 0.31] (3) 


0.69 [0.67, 0.71] (1) |0.77 [0.76, 0.78] (2) 0.30 [0.29, 0.31] (2) |0.63 [0.61, 0.65] (7) |0.76 [0.75, 0.77] (6) |0.32 [0.31, 0.33] (6) 
NA NA NA 


Dist+Chull|NA NA NA 


0.31 [0.29, 0.32] (4) 


performance of the newly trained model with the existing 
full-scale model. 


In Table 4, we picked the top five scales that were 
important in individual prediction tasks. We found that big- 
small, helpful-harmful, complex-simple, and fast-slow were 
commonly important Osgood scales for predicting students’ 
performance in immediate post-test and delayed post-test 
sessions. Scales like bad-good and passive-active were only 
important scales in the immediate post-test prediction. 
Likewise, new-old was an important scale only in the delayed 
post-test prediction. 


Table 4: Scale-wise importance of each Osgood 
scale. Scales were selected based on the sum of each 
evaluation metric’s rank. (Bold: Osgood scales that 
were commonly important in both prediction tasks; 
*: top five scales in each prediction task including 
tied ranks) 


Imm. post-test 
auc] Fu Bee | Al 
bad-good 1 
passive-active 
powerful-helpless 
big-small 
helpful-harmful 
complex-simple 
fast-slow 
noisy-quiet 
new-old 
healthy-sick 


Del. post-test 
AUC| Fy | Err 
4 10 


oO 
i=) 


FN ONTUDWHO KR 
FP OOCONNOUAB DWH 
COWOMOWNEFNF OK 


i) 
2 
7 
3 
4 
8 
5 
6 
9 
1 


OGNAWBNYEF KF OO 
NN OR ORF WOOD 


4.2.2 Performance of Reduced Models 


Based on the scale-wise importance analysis results, we built 
reduced-scale models that only contain features with more 
important Osgood scales. The prediction performance of 
reduced-scale models was similar or marginally better than 
full-scale OSG models. For example, the OSG model for 
predicting the immediate post-test outcome with the top 
two scales (bad-good and passive—active) were marginally 
better than the full-scale model (AUC: 0.71 [0.70, 0.72], Fi: 
0.76 |0.75, 0.77], error rate: 0.30 [0.29, 0.30]). Similar results 
were observed for predicting retention in the delayed post- 
test (selected scales: helpful-harmful, big-small) (AUC: 0.71 
(0.69, 0.72], Fi: 0.79 [0.78, 0.80], error rate: 0.28 [0.27, 


0.29 [0.27, 0.30] (2) 


0.29]). Although differences were small, the results indicate 
that using a small number of Osgood scales can be similarly 
effective to the full-scale model. 


5. DISCUSSION AND CONCLUSIONS 


In this paper, we introduced a novel semantic similarity 
scoring method that uses predefined semantic scales to 
represent the relationship between words. By combining 
Osgood’s semantic scales [16] and Word2Vec [13], we could 
automatically extract the semantic relationship between 
two words in a more interpretable manner. ‘To show this 
method can effectively represent students’ knowledge in 
vocabulary acquisition, we built prediction models that can 
be used to predict the student’s immediate learning and 
long-term retention. We found that our models performed 
significantly better than the baseline and the response- 
time-based models. In the future, we believe results from 
using an Osgood scale-based student model could be used 
to provide a more personalized learning experience, such 
as generating questions that can improve an individual 
student’s understanding for specific semantic attributes. 


Based on our findings, we have identified the following 
points for further discussion. First, in Section 4.1, we 
found that models using Osgood scale scores perform 
similarly with models using regular Word2Vec_ scores 
for predicting students’ long-term retention of acquired 
vocabulary. However, we think our models can be further 
improved by incorporating additional features. For example, 
non-semantic score-based features like response time and 
orthographic similarity among responses can be_ useful 
features for capturing different patterns of false predictions 
of current models. Moreover, some general measures to 
capture a student’s meta-cognitive or linguistic skills could 
be helpful to explain different retention results found even if 
students provided the same response sequences. Similarly, in 
Section 4.1.3, we found that Osgood scores can be a better 
metric to characterize the relationship between responses 
in terms of predicting students’ retention. A composite 
model that uses both regular Word2Vec score-based feature 
(target-response distance) and Osgood scale score-based 
feature (response-response distance) may also provide better 
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Figure 3: ROC curves and precision and recall curves for selected immediate post-test prediction models 
Curves are smoothed out with a local polynomial 
regression method based on repeated cross-validation results. 


(left) and delayed post-test prediction models (right). 
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prediction performance. 


Second, we found that models with a reduced number of 
Osgood scales performed marginally better than the full- 
scale model. However, differences were very small. Since 
this study only used some of the semantic scales from 
Osgood’s study [16], further investigation would be required 
to examine the validity of these scales, including other scales 
not used for this study, for capturing the semantic attributes 
of student responses during vocabulary learning. 


Also, there were some limitations in the current study 
and areas for future work. First, expanding the scope 
of analysis to the full set of experimental conditions 
used in the study may reveal more complex interactions 
between these conditions and students’ short- and long- 
term learning. Second, this study used a fixed threshold 
of 0.5 for investigating false prediction results. However, an 
optimal threshold for each participant group or prediction 
model could be selected, especially if there are different false 
positive or negative patterns observed for different groups 
of students. Lastly, this study collected data from a single 
vocabulary tutoring system that was used in a classroom 
setting. Applying the proposed method to data that was 
collected from a non-classroom setting or other vocabulary 
learning system would be useful to show the generalization 
of our suggested method. 
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ABSTRACT 


We investigate generalizability of face-based detectors of mind 
wandering across task contexts. We leveraged data from two lab 
studies: one where 152 college students read a scientific text and 
another where 109 college students watched a narrative film. We 
automatically extracted facial expressions and body motion 
features, which were used to train supervised machine learning 
models on each dataset, as well as a concatenated dataset. We 
applied models from each task context (scientific text or narrative 
film) to the alternate context to study generalizability. We found 
that models trained on the narrative film dataset generalized to the 
scientific text dataset with no modifications, but the predicted mind 
wandering rate needed to be adjusted before models trained on the 
scientific text dataset would generalize to the narrative film dataset. 
Additionally, we analyzed generalizability of individual features 
and found that the lip tightener and jaw drop action units had the 
greatest potential to generalize across task contexts. We discuss 
findings and applications of our work to attention-aware learning 
technologies. 


Keywords 


Mind Wandering, Mental States, Attention Aware Interfaces, 
Cross-Corpus training. 


1. INTRODUCTION 


Consider a typical day when you were an undergraduate college 
student. Your first class is your favorite, so you are engaged in the 
lecture content and processing new information. In your next class, 
you watch a documentary about a subject that does not interest you, 
causing your attention to focus on unrelated thoughts of your social 
life, rather than processing the information in the video. Later, you 
work on a homework assignment that you find frustrating, leading 
to waning motivation. Towards the end of your day, you attend a 
chemistry lab, where you interact with a new educational game that 
teaches you the basics of chemical bonds. At some points you are 
enjoying the game, and thus engaged in deeply learning the content. 
However, you later become bored during a long period of repetitive 
gameplay, causing you to become distracted and miss important 
information. Throughout the day, your mental states (engagement, 
frustration, boredom) influenced your learning. Your learning 


experience could have been augmented with technology that 
responded to your changing mental state, thus assisting you in 
achieving the most effective learning experience. 


Educational interfaces that detect and respond to student mental 
States are driven by work on cognitive and affective state modeling, 
which has been investigated for many years. For example, attention 
and affect has been modeled in educational tasks such as reading 
comprehension [6, 16, 28] and computerized tutoring [3, 19], 
among others. In general, there has been a plethora of work that has 
modeled a variety of mental states within specific educational tasks 
(e.g., [2, 15, 19]) to better understand these states and use that 
knowledge to facilitate student learning. 


However, prior research has overwhelmingly investigated single 
task contexts, and has overlooked generalizability to different 
contexts. For example, models that track attention during reading 
might not generalize to lecture viewing, educational gaming, and 
so on. This makes it difficult to decouple task-specific effects from 
more fundamental patterns. In contrast, models that successfully 
generalize across multiple contexts should reveal observable 
signals (i.e. eye gaze, facial features, and physiology data) that are 
general, rather than task-specific. Models using such indicators will 
be key to developing adaptive technologies that are sensitive to 
student mental states and that can operate across a range of 
educational activities. 


We report results on modeling mental states in a generalized way 
using mind wandering (MW) as a case study. MW is a ubiquitous 
phenomenon where thoughts shift from task-related processing to 
task-unrelated thoughts [15]. MW is estimated to occur anywhere 
from 20% - 50% of the time, depending on the person, task, and 
environmental context [23]. It is has also been associated with 
lower performance on a variety of educational tasks, such as 
reading comprehension [16] and retention of lecture content [29], 
thus impacting student learning. 


As with work on other mental states, research on MW has largely 
failed to address models that generalize across contexts [6, 15]. 
MW detection has been investigated in reading comprehension [6, 
16], narrative and instructional film comprehension [25, 26], and 
student interaction with an intelligent tutoring system (ITS) [19]. 
To our knowledge, no work has investigated MW detection with 
the goal of generalizability across task contexts. 


We specifically investigate the generalizability of MW models 
across two task contexts - reading a scientific text and viewing a 
narrative film. These contexts were chosen because of their broad 
applicability to education in the classroom and online. For example, 
a documentary film could be shown in a sociology course or 
distance learning students could read instructional texts prior to 
engaging in an online discussion. 
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1.1 Related Work 


Cross corpus training has been researched in a variety of 
classification problems, such as sentiment analysis [31] and 
acoustic-based emotion recognition [35]. Cross corpus training 
seeks to improve robustness of machine-learned models by 
leveraging multiple datasets in classifier training and testing. For 
example, Webb and Ferguson [32] applied cross corpus training 
techniques to characterize the function of segments of dialogue 
using automatically extracted lexical and syntactic features called 
cue phrases. Each extracted cue phrase was used to classify a 
segment of dialogue. They trained separate classifiers on two 
different datasets, and applied the classifier to the dataset on which 
it was not trained. They found the cross-training results were 
comparable to the results of training and testing on the same dataset 
(e.g. the best cross-trained classifier achieved and accuracy of 71%, 
compared to an accuracy of 81% when trained and tested on the 
same dataset). Additionally, they examined generalizability of the 
cue phrases across datasets by reducing the feature set to contain 
only cues present in both datasets. They found that reducing the 
feature set yielded slight improvements, and demonstrated the 
discriminative nature of a small number of features. 


Zhang et. al. [35] similarly explored the use of multiple datasets for 
creating context-generalizable models. They built classifiers for 
valence and arousal on highly varied emotional speech datasets 
using a leave-one-corpora-out cross-validation technique. 
Additionally, they explored methods for data normalization (within 
each dataset and between datasets) and agglomeration of both 
labeled and unlabeled data. They found that, of their six emotional 
speech corpora, training on some subsets yielded higher accuracy 
than others. Their work suggested that careful selection of corpora 
best suited for training might yield better emotional speech 
recognition performance than an all-or-nothing approach to cross- 
corpus training. 


Our work approaches cross-corpus modeling through detection of 
MW. A variety of studies have investigated MW detection during 
educational tasks, such a reading [15], interacting with an 
intelligent tutoring system (ITS) [19], or watching an educational 
video [26]. No work has focused on MW from a cross-corpus 
modeling perspective, to our knowledge, so we review the 
individual studies below. 


Detection of MW from eye gaze features while reading has been 
amply investigated. For example, Bixler and D’Mello [4] built 
models to detect MW while students read texts about scientific 
research methods. This work made use of probe-caught reports 
(students respond yes or no to auditory thought probes of whether 
they were MW), instead of self-caught reports (students report 
whenever they catch themselves MW). Their analysis of eye gaze 
features showed that certain types of fixations were longer during 
MW. Specifically, they found that longer gaze fixations 
(consecutive fixations on a single word), first-pass fixations 
(fixations on a word during the first pass through a text), and single 
fixations (fixations on a word only fixated on once) were predictive 
of MW. In other work, Bixler and D’Mello [5] similarly used eye 
gaze features, but used self-caught reports of MW. They found that 
a greater number of fixations, longer saccade length, and line cross 
saccades were indicative of MW. Across studies on MW detection 
during reading, longer fixations were found to be indicative of MW 
[4, 15, 28], suggesting these features might generalize well. 


Pham and Wang [26] similarly used consumer-grade equipment to 
detect MW while students watched videos from massively open 
online courses (MOOCs). They made use of heart rate, detected by 


monitoring fingertip blood flow, using the back camera of a 
smartphone (i.e., photoplethysmography). Their models achieved a 
22% improvement over chance. Although their method for 
detecting MW could be implemented across a variety of tasks, the 
question of whether heart rate is indicative of MW across task 
contexts has not yet been investigated. 


Hutt et. al. provided limited evidence of generalizability of MW 
detection across different learning tasks during student interaction 
with an ITS [19]. They employed a genetic algorithm to train a 
neural network using context-independent eye-gaze features and 
context-dependent interaction features (e.g., current progress 
within the ITS). They achieved an F: value of .490 (chance = .190). 
This work provided some evidence of generalizability because the 
visual stimuli and interaction patterns varied throughout. For 
example, students interacted with an animated pedagogical agent in 
a scaffolded dialogue phase and completed concept maps without 
the tutoring agent in another interaction phase. However, it is still 
unclear if their model would generalize to a broader range of tasks, 
particularly less interactive ones like reading or film viewing. 
Furthermore, their best-performing models used context-dependent 
features, which could prevent the detector from generalizing to a 
task where those features could not be used. 


1.2 Novelty 

Our contribution is novel in a variety of ways. First, we demonstrate 
the feasibility of building cross-context detectors of mental states, 
specifically MW. Further, previous work on MW detection has 
sometimes made use of context-specific features (e.g., reading 
times) that are not expected to generalize to other contexts [19, 25]. 
In contrast, our work detects MW using only facial features and 
upper body movement, recorded using commercial-off-the-shelf 
(COTS) webcams that are expected to generalize more broadly. 
Additionally, the use of COTS webcams support a_ broader 
implementation of MW detectors as webcams are ubiquitous in 
modern technology. This is in contrast to prior research that has 
used specialized equipment, like eye trackers [15, 19, 25] or 
physiology sensors [7], which students would likely not have 
access to. 


2. DATASETS 


This study makes use of narrative film [23] and scientific reading 
comprehension [22] datasets collected as part of a larger project. 
Here, we include details pertaining to video-based detection of 
Mw. 


2.1 Narrative Film Comprehension 

Participants were 68 undergraduate students from a medium-sized 
private Midwestern university and 41 undergraduate students from 
a large public university in the Southern United States. Of the 109 
students, 66% were female and their average age was 20.1 years. 
Students were compensated with course credit. Data from four 
students were discarded due to equipment failure. 


Students viewed the narrative film The Red Balloon (1956), a 32.5- 
minute French-language film with English subtitles (Figure 1). The 
film has a musical score but only sparse dialogue. This short fantasy 
film depicts the story of a young Parisian boy who finds a red 
helium balloon and quickly discovers it has a mind of its own as it 
follows him wherever he goes. This film was selected because of 
the low likelihood that participants have previously seen it and 
because it has been used in other film comprehension studies [34]. 
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Figure 1. A screenshot of the narrative film (left) and scientific text (right) are shown. 


Students’ faces and upper bodies were recorded with a low-cost 
($30) consumer-grade webcam (Logitech C270). 


Students were instructed to report MW throughout the film by 
pressing labeled keys on the keyboard. Specifically, students were 
asked to report a task-unrelated thought if they were “thinking 
about anything else besides the movie” and a _ task-related 
interference if they were “thinking about the task itself but not the 
actual content of the movie.” A small beep sounded to register their 
report, but film play was not paused. After viewing the film, 
students took a short test about the content and completed 
additional measures not discussed further. 


We recorded a total of 1,368 MW reports from the 105 participants 
with valid video recordings. In this work, we do not distinguish 
between the two types of MW, instead merging the task-unrelated 
thoughts and the task-related interferences, both of which represent 
thoughts independent of the content of the film. 


2.2 Scientific Reading Comprehension 
Participants were 104 undergraduate students from a medium-sized 
private Midwestern university and 48 undergraduate students from 
a large public university in the Southern United States. Of the 152 
participants, 61% were female and their average age was 20.1 
years. Participants were compensated with course credit. Data from 
eight participants were discarded due to equipment failure. 


Students read an excerpt from Soap-Bubbles and the Forces which 
Mould Them [8]. Like The Red Balloon (Figure 1), we chose this 
text because its content would likely be unfamiliar to a majority of 
readers. The text contained around 6,500 words from the first 
chapter of the book. In all, 57 pages (screens of text) with an 
average of 115 words each were displayed on a computer screen in 
36-pt Courier New typeface. The only modification to the text was 
the removal of images and references to them after verifying that 
these were not needed for comprehension. 


Students who read the scientific text were instructed to report MW 
in the same way as those who watched the narrative film. They were 
instructed to report a task-unrelated thought if they were “thinking 
about anything else besides the task” and a task-related interference 
if they were “thinking about the task itself but not the actual content 
of the text.” Participants completed a comprehension assessment 
after reading the text. We recorded a total of 3,168 MW reports 
from the 144 students with valid video recordings. 


2.3 Self Reports of MW 


MW was measured via self-reports in both studies, so it is prudent 
to discuss the validity of self-reports. We used self-reports because 


this is currently the most common approach to measure an 
inherently internal (but conscious) phenomenon [5, 15]. Self- 
reported MW has been linked to predictable patterns in physiology 
[30], pupillometry [17], eye-gaze [28] and task performance [27], 
providing evidence for the convergent and predictive validity for 
this approach. To improve the quality of self-reports, we 
encouraged students to report honestly and assured them that 
reporting MW would not in any way effect the credit they received 
for participation. 


The alternative to using self-caught reports is using probe-caught 
reports, which require a student response to a thought-probe (e.g., 
a beep). We chose self-caught reports over the probe-caught 
because the probe-caught method can potentially interrupt the 
comprehension process (i.e., when participants report “no” to the 
probes). Interruptions are particularly problematic in the film 
comprehension task, as participants did not have control over the 
media presentation (i.e., no pausing or rewinding of the film). 
Furthermore, it is also unclear if a probe-caught report takes place 
at the beginning or end of MW, or somewhere in between. 
Conversely, self-caught reports are likely to occur at the end of a 
MW episode when the student became aware that they were not 
attending to the task at hand. 


3. MACHINE LEARNING 


We explored a variety of machine learning techniques for cross- 
context MW detection using the same approach to segmenting 
instances and constructing features for both datasets. 


3.1 Segmenting Instances 

Reports of MW were distributed throughout the course of the film 
viewing or text reading session. We created instances that 
corresponded to reports of MW by first adding a 4-second offset 
prior to the report. This was done to ensure that we captured 
participants’ faces while MW vs. in the act of reporting MW itself 
(i.e., the preparation and execution of the key press). This 4-second 
offset was chosen based on four raters judgements of whether or 
not movement related to the key-press could be seen within offsets 
ranging from 0 to 6 seconds. Data was then extracted from the 20 
seconds prior to the MW report. A window size of 20 seconds was 
chosen based on prior experimentation that sought to balance 
creating as many instances as possible (shorter window sizes) and 
having sufficient data in each window (longer window sizes) to 
detect MW. 


We extracted “not MW” instances from windows of data between 
MW reports. The entire session (reading or video watching) was 
divided into 24-second segments (20 second windows of data and 
a 4 second offset as with the MW segments). Any segments 
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overlapping the 30 seconds prior to a MW report were discarded. 
We do not know precisely when MW starts, so we chose to discard 
instances overlapping the 30 seconds prior to MW reports, to 
separate students when they were actually MW from when they 
were not. We also discarded any segments overlapping a page turn 
(discussed in Section 3.2). All remaining segments were labeled 
Not MW. Our approach to segmenting instances is shown in Figure 
2. 


4-sec 


“Not MW) fiat ec MW Not MW 


Page Turn 


MW 
report 


Figure 2. Illustration of the instance extraction method. 


3.2 Instance Selection 

A full accounting of the instance selection process is shown in 
Table 1. Our goal was to make the two data sets as similar as 
possible so that task-specific effects could be studied without 
additional confounds. 


We first discarded any instances where there was less than one 
second of usable data in that time window. Data was not usable 
when the student’s face was occluded due to extreme head pose or 
position, hand-to-face gestures, and rapid movements. 
Additionally, for the scientific reading dataset, we discarded 
instances that overlapped with page turn events. In prior 
experimentation, we trained a model to detect MW using only a 
binary feature of whether or not that instance overlapped a page 
turn boundary. MW was detected at rates above chance in this 
experimental model. Therefore, we concluded that including 
instances that overlapped page turn boundaries would inflate 
performance as the detector could simply be picking up on the act 
of pressing the key to advance to the next page. 


After discarding instances using the method above, we matched the 
scientific reading and narrative film datasets on school (medium- 
sized Midwestern private university or large Southern public 
university), reported ethnicity, and reported gender. The scientific 
reading dataset was randomly downsampled to contain 
approximately the same number of students in each gender, race, or 
school category, as the film dataset. This participant-level matching 
on school, ethnicity, and gender was done to eliminate external 
sources of variance that could influence MW detection, potentially 
obfuscating task effects from population effects. 


Finally, the datasets were downsampled to contain equal numbers 
of instances because the size of the training set is known to bias 
classifier performance [13]. We also downsampled the data to 
achieve a 25% MW rate in order to be consistent with research that 
suggests that MW occurs between 20% and 30% of the time during 
reading and film comprehension [6, 23]. Further, the MW rates of 
30% and 14% obtained in these data are more artefacts of the 
instance segmentation approach rather than the objective rate, so 
resampling ensures a dataset that is more reflective of expected 
MW rates. 


Table 1. An accounting of instance selection process 


Reading Film 

(% MW) (% MW) 
Base 7,267 (30%) 7,313 (14%) 
Face Detected 7,266 (30%) 7,238 (14%) 
Page Boundary 1,400 (36%) N/A 
Participant Matching 1,273 (35%) N/A 
Downsampling 1,100 (25%) 1,100 (25%) 


3.3 Feature Extraction and Selection 

We used commercial software, the Emotient SDK [36] to extract 
facial features. The Emotient SDK, a version of the CERT 
computer vision software [24] (Figure 3) provides likelihood 
estimates of the presence of 20 facial action units (AUs; specifically 
1, 2, 4,5, 6,7, 9, 10, 12, 14, 15, 17, 18, 20, 23, 24, 25, 26, 28, and 
43 [14]) as well as head pose (orientation), face position (horizontal 
and vertical within the frame), and face size (a proxy for distance 
to camera). Additionally, we used a validated motion estimation 
algorithm to compute gross body movements [33]. Body movement 
was Calculated by measuring the proportion of pixels in each video 
frame that differed by a threshold from a continuously updated 
estimate of the background image generated from the four previous 
frames. 


Figure 3. Interface demonstrating AU estimates detected from 
a face video. 


Features were created by aggregating Emotient estimates in a 
window of time leading up to each MW or Not MW instance using 
minimum, maximum, median, mean, range, and standard deviation 
for aggregation. In all, there were 162 facial features (6 aggregation 
functions x [20 AUs + 3 head pose orientation axes + 2 face 
position coordinates + face size + Motion]). Outliers (values greater 
than three standard deviations from the mean) were replaced by the 
closest non-outlier value in a process called Winsorization [11]. 


We used tolerance analysis to eliminate features with high 
multicollinearity (variance inflation factor > 5) [1], after which, 37 
features remained. This was followed by RELIEF-F [21] feature 
selection (on the training data only) to rank features. We retained a 
proportion of the highest ranked features for use in the models 
(proportions ranging from .05 to 1.0 were tested). Feature selection 
was performed using nested cross-validation on training data only. 
We ran 5 iterations of feature selection within each cross-validation 
fold (discussed below), using data from a randomly chosen 67% of 
students within the training set in each iteration. 


3.4 Supervised Classification and Validation 

Informed by preliminary experiments, we selected seven classifiers 
for more extensive tests (Naive Bayes, Simple Logistic Regression, 
LogitBoost, Random Forest, C4.5, Stochastic Gradient Descent, 
and Classification via Regression) using the WEKA data mining 
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toolkit [18]. For each classifier, we applied SMOTE [9] to the 
training set only. SMOTE, a common machine learning technique 
for dealing with data imbalance, creates synthetic interpolated 
instances of the minority class to increase classification 
performance. 


We evaluated the performance of our classifiers using leave-one- 
participant-out cross-validation. This process runs multiple 
iterations of each classifier in which, for each fold, the instances 
pertaining to a single participant are added to the test set and the 
training set is comprised of the instances for the other participants. 
Feature selection was performed on a subset of participants in the 
training set. The leave-one-out process was repeated for each 
participant, and the classifications of all folds were weighted 
equally to produce the overall result. This cross-validation 
approach ensured that in each fold, data from the same participant 
was in the training set or testing set but never both, thereby 
improving generalization to new participants. 


Accuracy (recognition rate) is a common measure to evaluate 
performance in machine learning tasks. However, any classifier 
that defaults to predicting the majority class label of an imbalanced 
dataset can appear to have high accuracy despite incorrect 
predictions of all instances of the minority class label [20]. This is 
particularly detrimental in applications where detecting the 
minority class is of upmost importance. In our task, we prioritized 
the detection of MW despite the large imbalance in our dataset. 
Therefore, we considered the F1 score for the MW label as our key 
measure of detection accuracy since F1 attempts to strike a balance 
between precision and recall. 


4. RESULTS 


4.1 Cross-dataset Training and Testing 

We trained three classifiers: one on the scientific text dataset, one 
on the narrative film dataset, and one on a concatenated dataset 
comprised of the first two. For each of the three training sets, the 
classifier that yielded the highest MW F: is shown in Table 2. We 
used leave-one-student-out cross validation for within-dataset 
evaluations. Conversely, to measure generalizability of the models 
across contexts we applied the classifier trained on scientific text 
data to the narrative film data, and vice versa. We compared our 
model to a chance model that classified a random 25% (MW prior 
proportion) of the instances as MW. This chance-level method 
yielded a precision and recall of .250 (equal to the MW base rate). 


Table 2. Results for the models with highest MW F; for the 
within-data set validation (cross-training results in 


parentheses). 
Training Set Classifier MW Fi Precision Recall 
Scientific Text Logitboost .441 (.267) .376(.252) .553 (.284) 
Narrative Film C4.5 .436 (.407) .303 (.278)  .775 (.760) 
Both Logistic  .424 314 655 


We calculated improvement over chance as (actual performance — 
chance)/(perfect performance — chance). All three models showed 
improvement over chance (25% for scientific text, 25% for 
narrative film, and 23% for the concatenated dataset) when trained 
and tested on the same dataset. When tested on the alternative 
dataset, the narrative film classifier generalized well to the 
scientific text dataset (21% improvement over chance). However, 
the scientific text model showed chance-level performance on the 
narrative film corpus (2% improvement over chance). The MW Fi 


of the concatenated dataset model was simply an average of the 
MW F: score of the individual datasets when the instance 
predictions of the individual datasets are separated (.413 for the 
scientific reading dataset and .436 on the narrative film dataset). 
These results showed that the concatenated classifier does not skew 
towards predicting one dataset better than the other, but rather 
predicts both models with comparable accuracy. 


Table 2 also shows precision and recall for each of the models. 
Across all models, recall was higher than precision, indicating a lot 
false positives. It is important to note the near chance-level recall 
and precision of the model trained on scientific reading data when 
applied to the narrative film data. The lack of improvement over 
chance for both recall and precision demonstrated the need to 
improve generalizability in both dimensions. Conversely, the cross- 
trained narrative film model had lower precision, but good recall, 
resulting in an improved MW F; score. 


4.2 Classifier Generalizability 

To address the negligible improvement over chance of the scientific 
text model when tested on the narrative film dataset, we repeated 
the training and testing using C4.5 as the classifier. The C4.5 
classifier was chosen because it generalized better when trained on 
the narrative film dataset than the Logitboost classifier generalized 
when trained on the scientific text dataset. The results are shown in 
Table 3, where we note no notable improvement over the previous 
Logitboost classifier in Table 2 (change from .267 to .287 when 
tested on the narrative film dataset). Therefore, the lack of evidence 
for generalizability for the scientific text model could be due to 
overfitting to the training set, rather than classifier selection. 


Table 3. Results (MW F1) for the C4.5 classifier for within- 
and cross- validation. 


Training Set Within Cross 


Scientific Text 0.425 0.287 
Narrative Film 0.436 0.407 
Both 0.415 N/A 


4.3 Prediction Threshold Adjustment 

We further investigated the lack of generalizability of the scientific 
text model by considering the MW prediction rate. We compared 
the performance of both models on the narrative film dataset. Recall 
dropped considerably more than precision (Table 2; recall dropped 
from .775 to .284; precision decreased from .303 to .252). We 
hypothesized that recall decreased because of a difference in 
predicted MW rates (Table 4). In fact, the predicted MW rate in the 
narrative film data dropped from 64% to 28% when applying the 
scientific text model to the same data. This supported our 
hypothesis that the low recall was linked to lower predicted MW 
rates. Furthermore, 39% of the correctly classified instances (true 
positives and true negatives) were MW when applying the narrative 
film model to the narrative film data compared to 12% for the 
scientific text model applied to the same data. This demonstrated 
that the scientific text model was much more prone to missing MW 
instances, further supporting our hypothesis. 


To address this, we adjusted the predicted MW rate of the scientific 
text model when applied to the narrative film dataset. The classifier 
outputs a likelihood of MW and we previously considered instances 
with likelihoods greater than .5 as MW. We adjusted that prediction 
threshold from .1 to 1 in increments of .1 (Figure 4) to investigate 
how changes in predicted MW rate (higher for lower thresholds) 
effected recall, and thus MW F1. 
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Table 4. Predicted MW Rates. 


Training Set Within Cross 
Scientific Text 38% 28% 
Narrative Film 64% 68% 
Both 52% N/A 
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Figure 4. MW precision, recall, and F: as the prediction 
threshold varies for the scientific text model applied to the 
narrative film dataset. 


We note that MW F; score degrades at a threshold of .5. We 
adjusted the threshold to .3 and yielded the results shown in Table 
5. After adjusting the MW prediction threshold, both precision and 
recall of the narrative film data applied to the scientific text model 
showed comparable performance to the cross-trained narrative film 
model. It is important to note that the adjusted MW prediction 
threshold yielded a predicted MW rate of 76%, much higher than 
the MW rate of the dataset (25%). As with the generalized narrative 
film model, this reduced precision because the high predicted MW 
rate produced a large number of false positives. 


Table 5. Results for models with highest MW F; (cross- 
training results in parentheses). Cross-training results for the 
scientific text model reflect a MW prediction threshold of .3. 


Training Set Classifier MW Fi Precision Recall 
Scientific Text Logitboost .441 (.416) .376(.276) .553 (.836) 
436 (.407) .303 (.278)  .775 (.760) 


Both Logistic 424 314 655 


Narrative Film C4.5 


4.4 Feature Analysis 

We analyzed the facial features to further study generalizability by 
predicting MW with different subsets of the entire feature set. The 
C4.5 classifier was chosen for this feature analysis because of its 
consistency on both the scientific text model and concatenated 
dataset. Each subset consisted of the features (e.g., median, 
standard deviation) from one AU, or from face position, size, 
orientation, or motion. Since tolerance analysis was not used here, 
we only considered the minimum, maximum, median, and standard 
deviation aggregated features to prevent redundancy (e.g., between 
median and mean). For example, we used the minimum, maximum, 
median, and standard deviation feature values for AUS (upper lid 
raiser) to predict MW. This approach was applied to the 20 AU 
subsets, aS well as face position, size, orientation, and motion 
subsets. We generated the same cross-training configurations of in 
Section 4.1 (i.e., train on scientific text, test on narrative film, etc.). 


To rank the subsets of features on generalizability, we examined 
MW F: scores when testing on the alternative dataset only. For 
example, using the AU9 (nose wrinkle) subset, we investigated 
MW F; value of scientific text model applied to the narrative film 
dataset and the narrative film model applied to the scientific text 
dataset. Table 4 shows these results only for features that achieved 
a MW F: of greater than .250 (chance) on all dimensions (within 
dataset validation and cross-training). We selected features for 
further analysis if their MW F1 was greater than .300 for both cross- 
training results. This value of .300 was used to filter out features 
that performed well on the within-dataset validation, but fell short 
on cross training. It also ensured that a feature performed better 
than chance on both cross-trained results (i.e., train on narrative 
film and test on scientific text, and vice versa), rather than only 
generalizing to one dataset. Using this criterion, only AU23 and 
AU26 showed notable improvement over chance. 


We used the C4.5 classifier to generate the same models in Table 2 
(train/test scientific text, train scientific text/test narrative film, etc.) 
using only the features from AU23 and AU26 (Table 7). None of 
these models (scientific text, narrative film, or concatenated) 
achieved a MW F; as high as those in Table 2, which used a 
combination of tolerance analysis and RELIEF-F to select features. 
This suggested that, while AU23 and AU26 might individually 
predict MW, when used together, their prediction power might be 
limited, compared to other feature selection techniques. 


Table 6. MW F: score for within-data set validation with 
cross-data set scores (in parentheses). 


Training Set 


Facial Feature Scientific Text Narrative Film 


AU4 (brow lowerer) .378 (.278) 398 (.395) 
AU6 (cheek raiser) 369 (.259) 361 (.321) 
AUS (nose wrinkler) — .300 (.268) 392 (.303) 
AU14 (dimpler) .303 (.267) .383 (.376) 
AU23 (lip tightener) —.334 (.333) .363 (.317) 
AU26 (jaw drop) .414 (.321) .365 (.357) 
Face Height (size) 322 (.256) 339 (.289) 
Face X (position) 404 (.316) 382 (.282) 


Table 7. Results for models when only using the C4.5 classifier 
on AU23 and AU26. 


Training Set Classifier MW Fi Precision Recall 


Scientific Text C4.5 383 (.272) .255(.206) .764 (.404) 
397 (.257) .333(.235) .491 (.284) 


Both C4.5 .368 271 0/9 


Narrative Film C4.5 


3. ANALYSIS 


We developed automated detectors of MW using video-based 
features in the contexts of narrative film viewing and scientific 
reading. The generalizability of these models was dependent on 
corpora on which the model was trained and the rate at which the 
model predicts MW. In this section, we discuss our main findings 
and applications of this work. We also discuss limitations and 
future work. 


5.1 Main Findings 


We expanded on previous MW detection work through cross- 
context modeling. We trained three models on three datasets 
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(scientific text, narrative film, and a dataset concatenated from the 
two). We found each of these models (trained and tested on the 
Same corpus) performed at a notable 23% to 25% improvement 
over chance. This demonstrated the feasibility of detecting MW on 
individual corpora. However, recall was greater than precision, 
indicating prediction of false positives. This should be considered 
when implementing MW detectors in educational environments 
where excessive prediction of student MW could be demotivating. 


We investigated generalizability of the single-dataset models (i.e. 
scientific text or narrative film) by applying the model to the dataset 
on which it was not trained. The model trained on the narrative film 
dataset maintained performance when applied to the scientific text 
dataset (Table 2), providing some evidence for generalizability, but 
this performance was boosted by high recall (and comparatively 
low precision). Precision and recall (and thus MW F1) were near 
chance-level when the model trained on the scientific text dataset 
was applied to the narrative film dataset, suggesting that the model 
might overfit to the scientific text training set. 


We attempted to address this problem by applying the C4.5 
classifier, as it comparatively generalized well when trained on the 
narrative film dataset. MW F; score for the scientific text classifier 
applied to the narrative film data again negligibly increased. This 
suggested that the training data (only scientific text) used was not 
appropriate for model generalization. This idea is supported by the 
performance of the narrative film model on the scientific text data 
(although detection of false positives is a limitation) and the notable 
improvement over chance (22% to 23%) for the concatenated 
dataset. The performance of both models suggested that there were 
discernable similarities between MW instances across the two 
datasets, which can be detected using our techniques. 


In addition to training data, we also found that predicted MW rate 
effected model generalizability. We adjusted MW predictions 
according to a sliding threshold for the narrative film predictions 
obtained from the scientific text model. We found that relaxing the 
criteria for classifying an instance as MW (i.e. adjusting the 
likelihood prediction threshold from .5 to .3) yielded results 
comparable to the cross-trained narrative film model. However, this 
approach to increasing recall should be used with caution as it leads 
to increased likelihood of false positives. Perhaps in a real-time 
MW intervention scenario, a more balanced approach could be 
taken where the MW likelihood prediction is used to determine if a 
MW intervention is triggered (e.g., if the detector determines there 
is a 40% likelihood the student is MW, then there is a 40% chance 
a MW intervention is triggered). 


We detected MW using individual feature subsets to ascertain 
whether certain face-based features (i.e. AUs, head orientation, 
position, size, and motion) generalize. We found two feature 
subsets (AU23 -— lip tightener and AU26 — jaw drop) that showed a 
MW F; of at least .300 on both cross-trained models. It is notable 
that when looking at the generalizability of these features, they did 
not individually achieve MW F: scores as high as the best 
performing models in Table 2. This demonstrated the need for 
multiple features to work together to detect MW, rather than relying 
on a single feature. Furthermore, this showed that our method of 
feature selection (tolerance analysis and selecting a proportion of 
features using RELIEFF) was important to model performance. 


5.2 Applications 

The present findings are applicable to educational user interfaces 
that involve reading or film comprehension. Monitoring and 
responding to MW could greatly improve student performance on 
these tasks. Films and instructional texts play a major role in 


learning (both in the classroom and online). For example, films can 
give historical background on a time period being discussed in 
literature classes and instructional texts can supplement lecture 
content through textbooks or technical articles. Due to the 
relationship between MW and low task performance, user 
interfaces that detect and respond to MW in contexts where 
attention is key (i.e. education) would help students remain focused 
on their learning. 


These findings are particularly promising for implementation in 
massively open online courses (MOOCs). Our method for detecting 
MW exclusively uses COTS webcams. These webcams are 
ubiquitous in today’s computers and mobile devices; thus our work 
would integrate into a variety of learning environments without 
extra cost. Such a video-based detector of MW could feasibly 
respond to student MW through suggesting a student revisit text or 
video content, asking a reengaging question, or advising the student 
to take a break. 


5.3 Limitations and Future Work 


While we demonstrated techniques for modeling generalizability 
across task contexts, our work has a few limitations. First, precision 
is moderate, even on our best models. High predicted MW rates 
lead to high recall, but also more false positives. In this work, we 
chose to accept this tradeoff, with the goal of generalizability in 
mind. However, raising precision, while maintaining recall is key 
to task-generalizable MW detectors being successful in educational 
environments. Since MW is the minority class (25% of all 
instances), investigating skew-insensitive classifiers, such as 
Hellinger Distance Decision Trees [10], could improve precision. 


Additionally, this work focuses exclusively on generalizability 
from the perspective of task context (viewing a narrative film vs. 
reading a scientific text). Claims of generalizability could be 
strengthened through MW detection across environments. Both the 
narrative film and scientific reading datasets were collected in a 
controlled lab setting. MW detection in the field, such as computer- 
enabled classrooms or the personal workstations of MOOC users, 
should be considered prior to implementation in such 
environments. Furthermore, student generalizability should be 
further examined. In this work, we detect MW in a student- 
independent way. However, participants were all of similar age and 
enrolled in college. Future work could examine the generalizability 
of our method for detecting MW in non-college-aged students, such 
as elementary students in a computer-enabled classroom or non- 
traditional students enrolled in distance learning courses. 


39.4 Concluding Remarks 

In this work, we showed evidence that generalizable detectors of 
MW can be created using video-based features. The corpora used 
to train models of MW and predicted MW rates both play a role in 
the model’s ability to generalize and should be considered as work 
on cross-context MW generalization advances. This work advances 
the field of attention-aware interfaces [12] by demonstrating the 
feasibility of modeling MW across the educational contexts of 
reading a scientific text and viewing a narrative film. Our approach 
to detecting MW is the first step towards building interfaces that 
detect MW across multiple educational activities. 
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ABSTRACT 


We present results of a randomized controlled study that 
compared different types of affective messages delivered by 
pedagogical agents. We used animated characters that were 
empathic and emphasized the malleability of intelligence and 
the importance of effort. Results showed significant corre- 
lations between students who received more empathic mes- 
sages and those who were more confident, more patient, ex- 
hibited higher levels of interest, and valued math knowledge 
more. Students who received more growth mindset mes- 
sages, tended to get more problems correct on their first 
attempt but valued math knowledge less and had lower 
posttest scores. Students who received more success/failure 
messages tended to make more mistakes, to be less learning- 
oriented, and stated that they were more confused. We con- 
clude that these affective messages are powerful media to 
influence students’ perceptions of themselves as learners, as 
well as their perceptions of the domain being taught. We 
have reported significant results that support the use of em- 
pathy to improve student affect and attitudes in a math 
tutor. 


Keywords 
student affect, empathy messages, growth mindset, peda- 
gogical agents, intelligent tutor, confidence 


1. INTRODUCTION 


Students experience many emotions while studying and tak- 
ing tests [16]. Students’ emotions (such as confidence, bore- 
dom, and anxiety) can influence achievement outcomes [10, 
18] and predispositions (such as low self-concept and pes- 
simism) can diminish academic success [5, 14]. 


Proceedings of the 10th International Conference on Educational Data Mining 


Rafael Lizarralde 
University of Massachusetts 
Amherst 
140 Governors Drive 
Amherst, MA 01003-9264 
rezecib@cs.umass.edu 


lvon Arroyo 
Worcester Polytechnic Institute Worcester Polytechnic Institute 
100 Institute Rd 
Worcester, MA 01609 
larroyo@wpi.edu 


Danielle Allessio 
University of Massachusetts 
Amherst 
140 Governors Drive 
Amherst, MA 01003-9264 
allessio@umass.edu 


Naomi Wixon 


100 Institute Rd 
Worcester, MA 01609 
mwixon@wpi.edu 


Pekrun’s Control-Value Theory of emotion has been experi- 
mentally validated by classroom experiments that used stu- 
dent self-reports (answers to 5-point scale survey questions). 
These experiments provide evidence that educational inter- 
ventions can reduce students’ anxiety [16, 19]. 


Dweck’s Growth Mindset Theory suggests that students who 
believe that intelligence can be increased through effort and 
persistence tend to seek out academic challenges, compared 
to those who view their intelligence as immutable [8, 9]. 
Students who are praised for their effort (as opposed to per- 
formance) are more likely to view intelligence as being mal- 
leable, and their self-esteem remains stable regardless of how 
hard they have to work to succeed at a task. 


Hattie and Timperley [13] studied which types of feedback 
and conditions enable learning to flourish and which cases 
stifle growth. According to their study feedback is intended 
to help a student get from where they are to where they need 
to be. Graesser et al., [12] reported that there are significant 
relationships between the content of feedback dialogue and 
the emotions experienced during learning. They found sig- 
nificant correlations between dialog and the affective states 
of confusion, eureka (delight) and frustration. 


Pekrun et al., [17] tested a theoretical model positing that 
a student’s anticipated achievement feedback in a classroom 
setting influences his/her achievement goals and emotions. 
For example, self-referential feedback, in which a student’s 
competence is defined in terms of self-improvement, had a 
positive influence on a student’s mastery goal adoption. On 
the other hand, normative feedback, in which student compe- 
tence is defined relative to other students’ mastery goals and 
performance goals, had a positive influence on performance- 
approach and performance-avoidance goal adoption. Fur- 
thermore, feedback condition and achievement goals pre- 
dicted test-related emotions (i.e., enjoyment, hope, pride, 
relief, anger, anxiety, hopelessness, and shame). 


Teachers have limited opportunities to recognize and re- 
spond to individual student’s affect in typical classrooms. 
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Ideally, digital learning environments can manage the deli- 
cate balance between motivation and cognition, promoting 
both interest and deep learning. The overwhelming majority 
of work on affect-aware virtual tutors has focused on mod- 
eling affect, i.e., designing computational models capable of 
detecting how students feel while they interact with intelli- 
gent tutoring systems [2]. While modeling affect is a critical 
first step, very little research exists on systematically explor- 
ing the impact of interventions on students’ performance, 
learning, and attitudes, i.e., how an environment might re- 
spond to students emotions (e.g., frustration, anxiety, and 
boredom) as they arise. D’Mello and Graesser carried out 
close research work on empathic characters in AutoTutor, 
a conversational tutor that uses 3D companions to conduct 
dialogs in natural language with students [6, 7, 11]. 


1.1 MathSpring 


The testbed for this research is MathSpring, an intelligent 
tutor that personalizes mathematics problems, provides help 
using multimedia, and effectively teaches students to im- 
prove in standardized test scores [4]. Learning companions 
(Figure 1) in MathSpring suggest to students that their ef- 
fort contributes to success, and that making mistakes only 
means more effort is needed. Companions use about 20 dif- 
ferent messages focused on effort and growth mindset (Ta- 


ble 2). 


To date, MathSpring learning companions have provided 
positive significant effects for the overall population of stu- 
dents and were more effective for lower achieving students 
and for female students in general [2]. However, charac- 
ters seemed to have been harmful to some students (e.g., 
high-achieving males), who had higher affective baselines at 
pretest time and seem to have been distracted by the charac- 
ters. These results suggest that affective characters should 
probably be different for students who are not presently frus- 
trated or anxious (often high achieving males). One possi- 
bility is that the behavior of the characters be adaptive to 
the affective state of the student. 


1.2 Recognize and Respond to Affect 
Previously, we evaluated the hypothesis that tailored af- 
fective messages delivered by digital animated char- 
acters may positively impact students emotions, at- 
titude, and learning performance. Specifically, we iden- 
tified concrete prescriptive principles about how to respond 
to student emotion as it occurs during online learning [1, 3]. 
With models of student emotion, we explored mechanisms to 
address negative emotions. Our models predict confidence, 
interest, frustration, and excitement in real-time, based on 
data from hundreds of students. The gold standard was 
students’ self-reported responses to questions, such as “How 
confident do you feel right now?” 


We found that growth mindset messages based on Dweck’s 
theory [9] provide an apparent boost in student math 
learning [3], resulted in less performance-oriented goals 
(e.g., beating classmates, instead of a self-referenced focus), 
and less boredom reported on the posttest. ‘Typically 
online educational systems only report correctness: “Your 
answer is correct/incorrect.” We discovered that such suc- 
cess /failure messages are correlated to higher reported anx- 
iety and boredom, and appear to increase performance- 
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oriented goals/3]. Other results indicate that empathic 
characters can help decrease students’ anxiety and boredom. 
Our results showed that: a) student anxiety and boredom 
can be reduced using simple 2D characters, as did D’ Mello et 
al., (2007); b) these benefits are due primarily to empathy, 
and secondarily to growth mindset messages; and c) indicat- 
ing only success or failure is actively harmful to students, 
in comparison to emphasizing the learning process and the 
importance of effort. 
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Figure 1: Learning companions respond to student 
actions with gestures and messages shown both as 
text and audio. Above: Companion shows high in- 
terest while the student views an example problem 
with solution steps shown. Below: Companion pro- 
vides a growth mindset message, encouraging the 
student to put in effort to become good at math. 


1.33. Research Goals 


The research questions in this paper focus on identifying 
messages that support students’ motivation to persist work- 
ing on a task. Which messages (see Table 2) should a tutor- 
ing system send to students to encourage them to persist? 
How should agents respond to negative emotions? Should 
students be praised when they do well? Are the benefits to 
student learning and emotion due to empathic or motiva- 
tional aspects of the companion? What are the results on 
learning and emotion of using an empathic or less empathic 
companion in comparison to a companion that indicates only 
success or failure? 
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Table 1: Outcomes variables measured in the experiment. The questions on the pre- and posttest were 
answered in a 5-point scale, going from “not at all” to “very much”. 


Interest - Students’ interest in math. “Are you interested when solving math problems?” 

Excitement - How exciting students find math. “Do you feel that solving math is exciting?” 

Confusion - How confused students feel while solving math problems. “Do you feel confident that you will 
eventually be able to understand the Mathematics material?” 

Frustration - How frustrating students find math. Average of “Do you get frustrated when solving math prob- 
lems?” and “Does solving math problems make your feel frustrated?” 

Learning Orientation - How much students focus on learning as opposed to performance. Average of “When 
you are doing math exercises, is your goal to learn as much as you can?” and “Do you prefer learning about things 
that make you curious even if that means you have to work harder?” 

Performance Approach Goals - “Do you want to show that you are better at math than your classmates?” 
Math Value - How important do students think math is. “Compared to most other activities, how important is 


it or you to be good at math?” 


Math Liking - Measure of how much students like math. “Do you like your math class?” 
Math Test Performance - Student’s score on math questions that are representative of the content covered in 


MathSpring. 


2. METHOD 


We conducted a randomized controlled study to evaluate 
three different types of affective messages delivered by ped- 
agogical agents (Table 2). The study took place in an ur- 
ban school district in Southern California with sixty-four 6th 
grade students in three math classes for four class sessions, 
during December 2016. On part of the first and last day, 
students completed a pretest and posttest including ques- 
tions related to various affective states, and questions about 
mathematics. Outcome variables measured from these ques- 
tions are provided in Table 1. 


Three conditions of learning companion messages were ran- 
domly assigned to students and delivered in both audio and 
written form in order to increase the likelihood of expo- 
sure: 1) Empathy Condition for 24 students, 2) Growth 


Mindset Condition for 20 students and 3) Success/Failure 


Condition for 20 students; see Table 2 for examples of the 
different types of messages. For all conditions, students were 
asked to self-report their frustration or confidence in a five- 
point scale every five minutes or every eight problems, which 
ever came first, but only after a problem was completed. 
The prompts were shown on a separate screen and invited 
students to report on their frustration or confidence. 


The Empathy condition was set to visually reflect positive 
emotion with a certain probability for each math problem 
if the last student emotion report had a positive valence. 
When the most recent emotion report had a negative va- 
lence, and with a certain probability, the character first vi- 
sually reflected the negative emotion; then it reported an 
empathy message such as “Sometimes these problems make 
me feel [frustrated]”, and finally a connector such as “on the 
other hand”, connected with a growth mindset message such 
as “I know that putting effort into problem solving and learn- 
ing from hints will make our intelligence grow.” Note that 
only students experiencing negative emotions were exposed 
to growth mindset messages, as opposed to the following 
condition. 


The Growth Mindset condition emphasized messages that 


accentuate the importance of effort and perseverance in achiev- 


ing success. ‘The growth mindset condition was set to pro- 
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vide one of many growth mindset messages after a second in- 
correct attempt was made (the first incorrect attempt caused 
the hint button to flash), regardless of students’ emotions. 
This condition also provided occasional growth mindset mes- 
sages at the beginning of a new problem. 


The Success/Failure condition provided both traditional 
success /failure messages and some more basic meta-cognitive 
support for when students made mistakes (e.g., acknowledg- 
ing that their answer was not correct while encouraging them 
to use a hint). The success/failure condition provided stu- 
dents with a response if they answered a problem correctly 
and also after they made a second mistake. 


3. RESULTS 

Out of the 64, three students’ data were discarded due to 
minimal interaction with MathSpring. Across the N = 
61 students, 21066 event log rows were recorded for three 
classes over four separate days, from which several behav- 
ioral features were derived and used throughout the analysis; 
our data and processing scripts can be found on GitHub [15]. 
All the students completed a pretest and posttest. Students 
in empathy, growth mindset and success/failure conditions 
received a total of 978, 763, and 882 messages respectively. 
Means, standard deviations and percentage shares for each 
type of message are given in Table 3. It is important to 
note that students received messages from all categories but 
their condition emphasized the corresponding message type. 
For example, a student in growth mindset condition received 
significantly more growth mindset messages than a student 
in empathy condition. This distribution of messages means 
that different students saw different amounts of each type 
of message, which allows us to perform partial correlations 
with respect to the counts of each message type, separating 
their effects. 
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Table 2: Examples of messages spoken by characters. 


problems? Ido. But guess what. 


ondition Vlessage 
Non t you sometimes get frustrated trying to solve ma 
Empathy | Keep in mind that when you are struggling with are new idea or skill you are learning 
something and becoming smarter.” 
rey, congratulations: Your effort paid off, you got it rig 
Growth “Did you know that when we practice to learn new math skills our brain grows and gets 
Mindset stronger?” 
“Let’s click on help, and I am sure we will learn something.” 
uccess Very good, we got another one rig 
Failure “Hmm. Wrong. Shall we work it out on paper?” 


Figure 2: Time spent on a problem immediately before and after receiving the different categories of messages. 
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3.1 Partial Correlations 

First, we attempted to replicate the results of our previous 
exploratory work [3]. For the three message types, partial 
correlations of the total number of each messages were mea- 
sured for the nine posttest measures, controlling for the cor- 
responding pretest measure, time spent in the tutor, and 
message frequency (total messages heard / time spent). 


Table 4 shows the result of this analysis. We observe that 
with exposure to more empathic messages, students exhib- 
ited higher levels of interest and valued math knowl- 
edge more (rows 1 and 7). Increased interest can be viewed 
as analogous to the high negative correlation with boredom 
reported in our earlier work. With growth mindset mes- 
sages, students valued math knowledge less and had 
lower post test performance scores (rows 7 and 9). 


With success /failure messages, students were less learning- 


oriented and claimed to be more confused (rows 6 and 3). 


To further understand the dynamics, we derived some in- 
tutor variables and performed partial correlations shown in 
Table 5. The data for this analysis was derived as per stu- 
dent metrics based on their interaction with MathSpring. 
We observed that students tend to answer significantly more 
questions when in the success/failure condition and end up 
making more mistakes as well (rows 4 and 5). It is important 
to note that they also avoid asking for hints (row 6). It 
seems like these students tend to rush through the problems 
while being more careless. They also make more mistakes 
when they receive more growth mindset messages (row 5). 
This leads to simpler questions which they tend to get right 
in the first attempt (row 1). It appears that the students 
in empathy condition continue to invest more time on 
solving problems than rushing through the problem set. The 
number of problems seen by these students is significantly 
less (row 4). 


| ot Li “Ae se St Loe ae be SM ae oe 


As we see in Figure 2, students tend to spend less time 
on problems immediately after they receive growth mindset 
or success/failure messages. In contrast, the time spent on 
a problem increases slightly after receiving empathic mes- 
sages. Students who received more empathic and growth 
mindset messages tend to answer fewer questions than do 
students who received mostly success/failure message (Fig- 
ure 3). Combined with the last plot, it looks like the students 
in the empathy condition continue to invest more time on 
solving problems than rushing through the problem set. 


Figure 3: Problems seen per minute across different 
pedagogies 
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Table 3: The distribution of messages seen by students in each pedagogical conditions. 


Empathy Messages Growth Mindset Messages 


Condition N mean std % mean std % 

Empathy 7.48 , 16% 9.95 , 21% 
Growth 

Mindset 0.2 0.5 0.5% 10 5 26% 
acc 20 1.2 1.7 2.7% 4.6 4.8 10% 
Failure 


Success/Failure Messages 


mean std % 
29.1 22 62% 
27.9 19.2 73% 
38.3 26.6 86% 


Table 4: Partial correlations between different types of messages seen and posttest variables (Table 1), 
accounting for the corresponding pretest value, time spent in tutor and message frequency. 


Empathy Messages Growth Mindset Messages 


Variable 
corr p corr Pp 
(1) Interest 0.28 0.03 0.19 0.15 
(2) Excitement 0.00 1.00 -0.07 0.60 
(3) Confusion -0.05 0.74 -0.05 0.74 
(4) Frustration 0.10 0.43 -0.08 0.54 
Performance 

Oe -0.19 0.14 -0.05 0.70 

Learning 
Ce gaa 0.02 0.85 0.02 0.88 
(7) Math Value 0.25* 0.05 -0.22T 0.09 
(8) Math Liking 0.01 0.96 0.01 0.96 
(9) Performance -0.01 0.93 -0.23T 0.07 


Success/Failure Messages 


corr p 
-0.20 0.14 
-0.08 0.54 
0.32* 0.02 
-0.18 0.18 
0.20 0.12 

-0.24+ 0.06 
-0.10 0.45 
0.05 0.72 
-0.13 0.33 


p< 0.10, * p < 0.05 


Table 5: Partial correlations between different types of messages seen and within-tutor variables, accounting 


for time spent in the tutor and message frequency. 


Empathy Messages 


Variable 
corr p corr p 
% Problems Solved on age 
(1) Sear ce eras 0.06 0.62 0.34 0.007 
Avg Problem 
(2) Dae 0.07 0.61 -0.05 0.69 
(3) Learning Gain -0.10 0.50 -0.07 0.63 
(4) Problems Seen -0.23T 0.07 -0.04 0.78 
(5) Mistakes Made -0.01 0.91 0.59** 6E-7 
(6) Hints Per Problem 0.10 0.43 0.16 0.22 
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Growth Mindset Messages 


Success/Failure Messages 


corr p 
-0.01 0.94 
0.19 0.14 
-0.14 0.34 
0.77** Ap-13 

0.30* 0.02 

-0.22+ 0.10 


p< 0.10, * p< 0.05, ** p < 0.01 
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3.2. Markov Chain Analysis 


As students solve problems in the tutoring system, the learn- 
ing companion comments on their attempts; the effect of 
these messages on student affect is sequential, but the par- 
tial correlations do not capture this. To analyze this effect, 
we built Markov Chain models using in-tutor student self- 
reports of confidence and frustration. Each model describes 
transitions in affective states, from one self-report to the 
next, where students received a particular type of charac- 
ter messages (empathy, growth mindset, and success /failure) 
between self-reports. To reduce the state space, the 5-point 
scale used in the self-reports was simplified to two values - 
confident (> 3), not confident (< 3); similarly for frustra- 
tion. 


The goal of the Markov models was not to predict emotional 
changes, but rather to examine whether different messages 
had significant effects on affect. Markov models can show 
the probability of transitioning between affective states, but 
also have a stationary distribution, which represents the 
amount of students that would be in each state after un- 
dergoing many transitions. For example, a group of stu- 
dents were to use the system for many hours and receive 
only empathic messages, our model suggests that 99.5% of 
them would be confident about learning math (Figure 4). 


Figure 4: State transitions between the Confident 
(C) and Not Confident (N) affective states. The sta- 
tionary distribution is shown below each state. Only 
the empathy model was significant in the likelihood 
ratio test (p < 0.05) 


After = 

Empathic 0.99 a 
Message 99.5% 0.50 0.57% 

After Growth mee 

Mindset 0.89 mee 
Message 73% 0.31 277 
After 0.10 

Success / 0.90 SCT | N RS 0.61 
Failure 

Message 80% 9 onie 


We used a likelihood ratio test to analyze the significance 
of these models: the probability of the null model (ignoring 
message type) generating the data divided by the probability 
of the alternate model (for a particular message type) gener- 
ating the data gives a p-value. Figure 4 shows the state tran- 
sitions for confidence in the null model and the model for 
confidence after receiving empathic messages, which was 
significant with p = 0.0149 (the other models were not sig- 
nificant). We also examined the stationary distributions for 
each model (Table 6). 
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Table 6: Stationary distributions in the Markov 
models of confidence and frustration. 


Viessage Confidence Frustration 
Type Conf Not Frust Not 
Empathy | 99.5%* 0.05%* | 35% 65% 
Growth 
Mindset 7TA% 26% 30% 70% 
euccese/ 80% 20% 25% 75% 
Failure 
¥p <0.05 


4. DISCUSSION 


Some of our results support the hypothesis that affective 
messages delivered by characters can positively impact stu- 
dents’ emotions and affective predispositions for math prob- 
lem solving. This is particularly evident for empathy, as 
the more empathic messages a student saw the higher their 
interest in mathematics problem solving, as well as their be- 
liefs that mathematics is valuable to learn (Table 4). An 
analysis of student behavior suggests that students who saw 
a high frequency of empathic messages also tended to be 
more patient and cautious with problem solving, suggesting 
that empathic messages may encourage students to persist 
through adversity. Exposure to empathic messages was sig- 
nificantly correlated to investing time in each math prob- 
lem activity, leading also to fewer problems seen per ses- 
sion. A positive trend is exhibited between high frequency 
of empathic messages and hints requested, even if not signif- 
icant (Table 5). Empirical temporal models generated from 
students’ changes in self-reports of affect, within the tutor, 
revealed that students receiving empathic messages have a 
higher likelihood to become more confident and to remain 
confident. 


The response to growth mindset messages delivered by char- 
acters yielded mixed results. As students saw more of these 
kinds of messages they also succeeded more often at solving 
problems correctly (on the first attempt); interestingly, at 
the same time, they also made more mistakes. This is also 
desirable, as growth mindset messages emphasize that mak- 
ing mistakes is okay and can even help learning, legitimizing 
a high frequency of errors. It is possible that students were 
using those mistakes and hints to learn and succeed later on; 
a (not significant) positive trend suggests that students re- 
ceiving more of these kinds of messages also asked for more 
hints per problem. In contrast, marginally significant effects 
suggest that a high frequency of growth mindset messages 
might be detrimental to students’ perception of math value, 
and that their posttest performance is worse when they re- 
ceive more of this kind of messages. It is hard to conclude 
the meaning of these marginally significant effects, especially 
because a previous study suggested that these messages were 
beneficial in general [3]. Note that empathic messages used 
’growth mindset’ messages also, in order to resolve the nega- 
tive emotion (see Table 2). One possible explanation is that 
the empathic condition was so positive because it was also 
very selective at showing growth mindset messages for only 
those who experienced negative emotions. It is likely that 
high achieving students, or those who “felt OK”, rejected 
growth mindset messages that they might have perceived to 
be unnecessary. 
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An important comment is that we did not expect that suc- 
cess/failure messages could be so harmful to students. Re- 
gardless of whether messages indicated success or failure, as 
students received more of these messages they also exhibited 
lower levels of mastery /learning orientation at posttest time. 
They also reported higher levels of confusion at posttest time 
(note that the confusion can be positive for learning within 
the learning experience, but not after the learning experi- 
ence has concluded). Regarding behavior within the tutor, 
the more students were exposed to success/failure messages, 
the more they appeared to rush through problems, make 
mistakes, and request fewer hints per problem. 


To summarize, empathy messages were associated with vari- 
ables consistent with methodical work and an increased in- 
terest /value of mathematics. However, both growth mindset 
and success/failure messages appeared to be associated with 
a greater number of mistakes. Finally, success/failure mes- 
sages themselves were associated with a whole host of con- 
cerning behaviors such as confusion with the material follow- 
ing posttest, reduced learning orientation, hurried work, and 
a reduced likelihood of requesting hints. This is consistent 
with Dweck’s findings that growth mindset messages are su- 
perior to success/failure messages [8, 9]. Whether empathic 
messages in fact result in improved student performance pre 
to posttest will likely require larger samples than this small 
study (N = 61). However, students in non-empathic condi- 
tions have demonstrated significantly more mistakes in their 
work. 


5. CONCLUSIONS 


This research emphasizes the importance of understanding 
an intervention’s effect on a student’s affective state, which 
in turn is connected to engagement, performance, and learn- 
ing. Although many researchers have focused on modeling 
affect, very little research effort has been put into systemat- 
ically measuring the impact of the intervention on the stu- 
dent behavior in an adaptive learning environment. Em- 
pathic messages that respond to students’ recent emotions 
have resulted in superior results both in improving the stu- 
dent interaction with the system and in the overall learning 
experience. Growth Mindset follows next with some pos- 
itive impact on in-tutor performance but its overall effect 
in the short-term is questionable. Success/Failure messages 
are generally harmful to students: reducing learning ori- 
entation, increasing confusion, and making students more 
careless during the learning experience. 


We conclude that affective messages delivered by charac- 
ters in online tutoring environments are a very important 
medium for building student-tutor rapport in a virtual envi- 
ronment, powerful signals that influence perceptions of stu- 
dents themselves as learners, as well as perceptions of the 
domain being taught. We have reported significant results 
that support the use of empathy to improve student affect 
and attitudes in a math tutor. The long-term effect of these 
messages needs to be studied when the novelty of this in- 
tervention wears off. In the future, we hope to study the 
impact of the frequency and content of these messages. To 
evaluate the generalizability of these results, student popu- 
lations across different demographics needs to be studied as 
well as the applicability of the messages to domains beyond 
mathematics. 
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ABSTRACT 


This study investigates a possible way to analyze chat data from 
collaborative learning environments using epistemic network 
analysis and topic modeling. A 300-topic general topic model 
built from TASA (Touchstone Applied Science Associates) cor- 
pus was used in this study. 300 topic scores for each of the 15,670 
utterances in our chat data were computed. Seven relevant topics 
were selected based on the total document scores. While the ag- 
gregated topic scores had some power in predicting students’ 
learning, using epistemic network analysis enables assessing the 
data from a different angle. The results showed that the topic 
score based epistemic networks between low gain students and 
high gain students were significantly different (¢ = 2.00). Overall, 
the results suggest these two analytical approaches provide com- 
plementary information and afford new insights into the processes 
related to successful collaborative interactions. 


Keywords 


chat; collaborative learning; topic modeling; epistemic network 
analysis 


1, INTRODUCTION 


Collaborative learning is a special form of learning and interaction 
that affords opportunities for groups of students to combine cogni- 
tive resources and synchronously or asynchronously participate in 
tasks to accomplish shared learning goals [15; 20]. Collaborative 
learning groups can range from a pair of learners (called a dyad), 
to small groups (3-5 learners), to classroom learning (25-35 learn- 
ers), and more recently large-scale online learning environments 
with hundreds or even thousands of students [5; 22]. The collabo- 
rative process provides learners with a more efficient learning 
experience and improves learners’ collaborative learning skills, 
which are critical competencies for students [14]. Members in a 
team are different in many ways. They have their own experience, 
knowledge, skills, and approaches to learning. A student in a col- 
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laborative learning environment can take other students’ views 
and ideas about the information provided in the learning material. 
The ideas coming out of the team can then be integrated as a 
deeper understanding of the material, or a better solution to a 
problem. 


Traditional collaborative learning occurred in the form of face to 
face group discussion or problem solving. As the internet and 
learning technologies develop, online collaborative learning envi- 
ronments come out and are playing more and more important 
roles. For example, MOOCs (Massive Open Online Courses) have 
drawn massive number of learners. Learners in MOOCs are con- 
nected by the internet and can easily interact with each other using 
various types of tools, such as forums, blogs and social networks 
[23]. These digitized environments make it possible to track the 
learning processes in collaborative learning environments in 
greater detail. 


Communication is one of the main factors that differentiates col- 
laborative learning from individual learning [4; 6; 9]. As such, 
chats from collaborative learning environments provide rich data 
that contains information about the dynamics in a learning pro- 
cess. Understanding massive chat data from collaborative learning 
environments is interesting and challenging. Many tools have 
been invented and used in chat data analysis, such as LIWC (lin- 
guistic inquiry and word count) [12], Coh-Metrix [10], and topic 
modeling, just to name a few. Epistemic network analysis (ENA) 
has been playing a unique role in analyzing chat data from epis- 
temic games [18]. ENA is rooted in a specific theory of learning: 
the epistemic frame theory, in which the collection of skill, 
knowledge, identity, value and epistemology (SKIVE) forms an 
epistemic frame. A critical theoretical assumption of ENA is that 
the connections between the elements of epistemic frames are 
critical for learning, not their presence in isolation. The online 
ENA toolkit allows users to analyze chat data by comparing the 
connections within the epistemic networks derived from chats. 
ENA visualization displays the clustering of learners and groups 
and the network connections of individual learners and groups. 
ENA requires coded data which has traditionally relied on hand 
coded data sets or classifiers that rely on regular expression map- 
ping. Combining topic modeling with ENA will provide a new 
mode of preparing data sets for analysis using ENA. 


In this study, we used a combination of topic modeling and ENA 
to analyze chat data to see if we could detect differences between 
the connections made by students with high learning gains versus 
students with low learning gains. Incorporating topic modeling 
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with ENA will make the analytic tool more fully automated and of 
greater use to the research community. 


2. RELATED WORK 


Chats have two obvious features. First, they appear in the form of 
text. Therefore, any text analysis tool may have a role in chat 
analysis. Second, chats come from individuals’ interaction, which 
reflects social dynamics between participants. Therefore, a com- 
bination of text analysis and social network analysis should be 
helpful in understanding underlying chat dynamics. For instance, 
Tuulos et al. [21] combined topic modeling with social network 
analysis in chat data analysis. They found that topic modeling can 
help identify the receiver of chats (the person who a chat is given 
to). 


In a similar effort, Scholand et al. [16] combined LIWC and social 
network analysis to form a method called “social language net- 
work analysis” (SLNA). The social networks were formed by 
counting the number of times chat occurred between any two 
participants. Based on the counts, participants were clustered into 
a tree structure, representing the level of subgroups the partici- 
pants belong to. LIWC was then used to get the text features of 
chats. It was found that, some LIWC features were significantly 
different between in group conversations and out of group conver- 
sations. 


Researchers have also recently explored the advantages of com- 
bining SNA (social network analysis) with deeper level computa- 
tional linguistic tools, like Coh-Metrix. Coh-Metrix computes 
over 100 text features. The five most important Coh-Metrix fea- 
tures are: narrativity, syntax simplicity, word concreteness, refer- 
ential cohesion and deep cohesion. Dowell and colleagues [8] 
explored the extent to which characteristics of discourse diagnos- 
tically reveals learners’ performance and social position in 
MOOCs. They found that learners who performed significantly 
better engaged in more expository style discourse, with surface 
and deep level cohesive integration, abstract language, and simple 
syntactic structures. However, linguistic profiles of the centrally 
positioned learners differed from the high performers. Learners 
with a more significant and central position in their social network 
engaged using a more narrative style discourse with less overlap 
between words and ideas, simpler syntactic structures and abstract 
words. An increasing methodological contribution of this work 
highlights how automated linguistic analysis of student interac- 
tions can complement social network analysis (SNA) techniques 
by adding rich contextual information to the structural patterns of 
learner interactions. 


In another study, Dowell et al. [7] showed that students’ linguistic 
characteristics, namely higher degrees of narrativity and deep 
cohesion, are predictive of their learning. That is, students en- 
gaged in deep cohesive interactions performed better. 


In the present research, we explore collaborative interaction chat 
data using the combination of topic modeling and epistemic net- 
work analysis. While previous studies focused on the relationship 
between language features and social network connections, our 
study focuses on prediction learning performance by semantic 
network connections students make in chats. 


3. METHODS 


Participants. Participants were enrolled in an introductory-level 
psychology course taught in the Fall semester of 2011 at a large 
university in the USA. While 854 students participated in this 
course, some minor data loss occurred after removing outliers and 
those who failed to complete the outcome measures. The final 
sample consisted of 844 students. Females made up 64.3% of this 


final sample. Within the population, 50.5% of the sample identi- 
fied as Caucasian, 22.2% as Hispanic/Latino, 15.4% as Asian 
American, 4.4% as African American, and less than 1% identified 
as either Native American or Pacific Islander. 


Course Details and Procedure. Students were told that they 
would be participating in an assignment that involved a collabora- 
tive discussion on personality disorders and taking quizzes. Stu- 
dents were told that their assignment was to log into an online 
educational platform specific to the University at a specified time, 
where they would take quizzes and interact via web chat with one 
to four random group members. Students were also instructed 
that, prior to logging onto the educational platform, they would 
have to read material on personality disorders. After logging into 
the system, students took a 10 item, multiple choice pretest quiz. 
This quiz asked students to apply their knowledge of personality 
disorders to various scenarios and to draw conclusions based on 
the nature of the disorders. The following is an example of the 
types of quiz questions students were exposed to: 


e Jacob was diagnosed with narcissistic personality dis- 
order. Why might Dr. Simon think this was the wrong 
diagnosis ? 

e Dr. Level has measured and described his 10 mice of 
varying ages in terms of their length (cm) and weight 
(g). How might he describe them on these characteris- 
tics using a dimensional approach? 

e Danielle checks her facebook page every hour. Does 
Danielle have narcissistic personality disorder? 


After completing the quiz, they were randomly assigned to other 
students who were waiting to engage in the chatroom portion of 
the task. When there were at least 2 students and no more than 5 
students (M = 4.59), individuals were directed to an instant mes- 
saging platform that was built into the educational platform. The 
group chat began as soon as someone typed the first message and 
lasted for 20 minutes. The chat window closed automatically after 
20 minutes, at which time students took a second 10 multiple- 
choice question quiz. Each student contributed 154.0 words on 
average (SD = 104.9) in 19.5 sentences (SD = 12.5). As a group, 
discussions were about 714.8 words long (SD = 235.7) and 90.6 
sentences long (SD = 33.5). 


An excerpt of a collaborative interaction chat in a chat room is 
shown below in Table 1. (student names have been changed): 


Table 1. An excerpt of a collaborative interaction chat 


Student | Chat Text 

Art ok cool, everyone's here. sooo first question 

Art ok so the certain characteristics to be considered to 
have a personality disorder? 

Shaffer | Alright sooo first question: Based on these criteria de- 
scribe several reasons why a psychologist might not 
label someone with grandiose thoughts as having nar- 
cissistic personality disorder? 

Shaffer | hahaha never mind 

Shaffer | that was the second question. 

Art lol its all good 

Shaffer | okay so certain characteristics: doesn't it have to be like 
a stable thing? 

Carl 1 think the main thing about having a disorder is that its 


disruptive socially and/or makes the person a danger to 
himself or others 
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Vasile » yes, stable over time 


Shaffer | yeah, and it also mentioned it can't be because of drugs 
Art also they have to have like unrealistic fantasies 

Nia yeah and not normal in their culture 

Carl no drugs or physical injury 


Vasile | begins in early adulthood or adolescence 
Shaffer 


Art ok, so arrogance doesn't just define it, they have to have 
most of these characteristics 


i think that covers them? haha 


Art yeah 1 think we got them 


Shaffer is it most or 1s it like 6? 


From the above excerpt, we can see several obvious things. First, 
the lengths of the utterances varied from one single word to mul- 
tiple sentences. This needs to be considered in text analysis be- 
cause some methods work only for longer texts. For example, 
Coh-Metrix usually works well for texts with more than 200 
words. Topic modeling also needs enough length to reliably infer 
topic scores. Second, the number of utterances each participant 
gave were different. From how much and what a member said, we 
can see each member played a different role in that chat. Third, 
the ordered sequence of the utterances forms a time series. Under- 
standing and visualizing the underlying discourse dynamics are 
important for meaning making with this type of data. 


The data set contained 15,670 utterances, pretest scores (the first 
quiz) and post test scores (the second quiz) for 844 students, 
grouped in 182 chat rooms. Each chat room had 2 to 5 students, 
4.73 by average. The average speech turns each student gave was 
18.2 and the average speech turns in each room was 86.1. 


The average pretest score was 36.01% correct and the average 
post-test scores 45.73% correct. Paired sample test shows that the 
post-test is significantly higher (¢ = 14.13, N = 844). We com- 
puted the learning gain of each student, using the formula 

posttest score — pretest score 


gain = 
1-—pretest score 


For all students (N = 844), the average learning gain is 0.11, 
59.5% had positive learning gains above 0.1. 16.5% had the same 
scores and 23% had negative learning gains. Not surprisingly, 
students who had lower pretest scores had higher learning gains 
because they had greater potential to learn. Figure 1 shows the 
average learning gain as function of pretest score. 


Figure 1. Average learning gain as a function of pretest score. 


For students with pretest scores less than 50% correct (N=624), 
the average learning gain is 0.88, 69.7% had positive learning 
gains, 15.7% had the same scores and 14.6% had negative learn- 
ing gains. 


This data set has been analyzed in multiple studies. Cade et al. [3] 
analyzed the cohesion of the chats and found that deep cohesion 
of the chats predicts the students feeling of power and connected- 
ness to the group. Dowell et al. [7] found that some Coh-Metrix 
measures predicts learning. Coh-Metrix measures describe com- 
mon textual features that are not content specific. For example, 
cohesion is about how text segments are semantically linked to 
each other, which has nothing to do with what the text content is 
about. In this study, we use topic modeling to provide content 
dependent features and use epistemic network analysis to explore 
how the topics were associated in the chats. 


4. TOPIC MODELING 


Topic modeling has been widely used in text analysis to find what 
topics are in a text and what proportion/amount of each topic is 
contained. Latent Dirichlet Allocation (LDA) [2; 24] is one of the 
most popular methods for topic modeling. LDA uses a generative 
process to find topic representations. LDA starts from a large 
document set D = {d,,d2,°::,dy}. A word list W= 
{W1,W2,°*',W,,} is then extracted from the document set. LDA 
assumes that the document set contains a certain number of topics, 
say, K topics. Each document has a probability distribution over 
the K topics and each topic has a probability distribution over the 
given list of words. When a document was composed, each word 
that occurred in a document was assumed to be drawn based on 
the document-topic probability and the topic-word probability. 
For a given corpus (document set) and a given number of topics 
K, LDA can compute the topic assignment of each word in each 
document. 


For a given topic, the word probability distribution can be easily 
computed from the number of times each word was assigned to 
the given topic. The beauty of topic modeling is that the “top 
words” (words with highest probabilities in a topic) usually give a 
meaningful interpretation of a topic. The distributions are the 
underlying representation of the topics. The top words are usually 
used to show what topics are contained in the corpus. 


By counting the number of words assigned to each topic, a topic 
proportion score can be computed for each document on each 
topic. The topic proportion scores then become a document fea- 
ture that can be used in further analysis. However, the proportion 
scores are based on the statistical topic assignment of words. 
When documents are very short, such as most utterances in our 
chat data, the topic proportion scores won’t be reliable. Cai et al. 
[4] argued that alternative ways to compute document topic scores 
are possible. 


4.1 TASA Topic Model 


Although our chat data set contained 15,670 utterances, the utter- 
ances were short and the corpus is not large enough to build a 
reliable topic model. To get a reliable model, we used a well 
known corpus provided by TASA (Touchstone Applied Science 
Associates). This corpus contained documents on seven known 
categories, including business, health, home economics, industrial 
arts, language arts, science and social studies. Our content topic, 
personality disorders, is obviously in the health category. Of 
course, not all topics in TASA are relevant to our study. There- 
fore, after building up the model, we need to select relevant top- 
ics. We will cover that in the next sub-section. 
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There are a total of 37,651 documents in TASA corpus, each of 
which is about 250 words long. Before we ran LDA, we filtered 
out very high frequency words and very low frequency words. 
High frequency words, such as “the”, “of”, “in”, etc., won’t con- 
tain much topic information. Rare words won’t contribute to 
meaningful statistics. 28,483 words (it might be better to say 
“terms”) were left after filtering. A model with 300 topics was 
constructed by LDA. 


4.2 Topic score computation and topic selec- 
tion 

From the TASA topic model, we computed the word-topic proba- 
bilities based on the number of times a word was assigned to each 
of the 300 topics. Thus, each word is represented by a 300 dimen- 
sional probability distribution vector. For each chat in our chat 
corpus, we simply summed up the word probability vectors for the 
words appeared in each chat. That gave us 300 topic scores for 
each chat. Recall that, the chats were associated with a reading 
material and two quizzes. While the students were free to talk 
about anything, the content of the reading material and the quizzes 
set up the main chat topics, that is, personality disorders. 


Topic Score 


Figure 2. Sorted topic scores for topic selection. 


The first thing we needed to do then was to investigate whether or 
not the “hot” topics from the computation made sense. To find 
that out, we computed the sum of all topic scores over all chats. 
The topics were sorted according the total topic score. The hottest 
topic had a total score higher than 1300, much higher than the 
second highest (less than 900). By examining the top words, this 
topic is about “illness”, which is highly relevant to personality 
disorders. Six hot topics scored in the range from 600 to 900. 
They are about “outdoors”, “biology”, “people/social”, “educa- 
tion” and “healthcare”. The top words are listed below. 


e _ Illness: health, disease, patient, body, diseases, medical, 
stress, mental, physical, heart, doctor, problems, cause, 
person, patients, exercise, illness, problem, nurse, 
healthy 

e  Qutdoors: dog, energy, plants, earth, car, light, food, 
heat, words, animals, music, rock, language, children, 
air, uncle, city, sun, women, plant 

e Biology: cells, cell, genes, chromosomes, traits, color, 
organisms, sex, egg, species, gene, body, male, female, 
parents, nucleus, eggs, sperm, organism, sexual 

e Psychology: behavior, learning, theory, environment, 
feelings, sexual, physical, social, sex, human, research, 


person, animal, mental, response, positive, stress, per- 
sonality, subject, reaction 

e People/Social: joe, pete, mr, charlie, dad, frank, billy, 
tony, jerry, ‘Il, mom, 'd, going, 're, got, boys, looked, 
asked, paper, go 

e Education: students, teacher, teachers, child, children, 
student, school, education, schools, learning, parents, 
tests, test, program, teaching, behavior, skills, reading, 
team, information 

e Healthcare: patient, doctor, health, hospital, medical, 
dr, patients, nurse, disease, doctors, team, care, office, 
nursing, drugs, medicine, services, dental, diseases, help 


99 66 


“TlIness”, “biology”, “psychology” and “healthcare” are the topics 
the learning materials involved. “Education” topic 1s about the 
education environment where the chat happened. “Outdoor” and 
““people/social” are off-task topics. 


To get an idea about whether or not the topic scores were related 
to the learning gain, we aggregated the scores by person and com- 
puted the correlation between the total topic score and the learning 
gain for each topic. We were only interested in looking at the 
students with larger potential to learn, so we removed the data 
with pretest score greater than or equal to 0.5, leaving 624 stu- 
dents out of 844. The results (Table 1) showed that all topics were 
significantly correlated to learning gain. It doesn’t seem to be 
great, because that seems to suggest that, whatever topic a student 
talked about, more a student talked, larger gain the student ob- 
tained. The real reason is that in the aggregation, all topic scores 
were summed up. Therefore, all topic scores were influenced by 
the chat length. So the correlation in Table 2 basically showed the 
chat length effect. 


Table 2. Correlation between total topic scores and learning 
gain (N=624, pretest<0.5) 


Topic Post-test Pretest Gain 

Illness Los" * L6F* Fe Pl 
Outdoors 216°" 33°" 154% 
Biology 159°" M25" * .L05** 
Psychology 182** .096* .140%** 
People/Social | .115** 022 Op 
Education IS"? 118** Aa 
Healthcare A be haa A30F* .O97* 


To remove the chat length effect, the simplest way is to divide all 
scores by the number of words (terms) in each chat. However, in 
this study, to be consistent with subsequent analysis, we normal- 
ized the topic scores to topic proportion scores by dividing each 
topic score for each utterance by the sum of all seven topic scores 
of the same utterance. 


The results (Table 3) showed that the topic “people/social” had a 
significant negative correlation to learning gain. Others were not 
significant but were in the direction we would expect. “Illness”, 
“biology”, “psychology” and “healthcare” were positively corre- 
lated with gain scores, while “outdoors” and “people/social” top- 
ics were negatively correlated with gains scores. We observed 
almost no correlation for the “Education” topic. This seems to 
indicate that the aggregated topic scores have limited power in 
predicting learning. Therefore, we used ENA to examine the con- 
nections or association of these topics in the students discourse to 
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develop a predictive model of learning gains based on the use of 
these topics. 


Table 3. Correlation between normalized topic proportion 
scores and learning gain (N=624, pretest<0.5) 


Topic Post-test Pretest Gain 
Illness .099* 0.077 0.067 
Outdoors -0.063 -0.043 -0.044 
Biology .085* 0.054 0.063 
Psychology 0.067 0.019 0.058 
People/Social oe a -0.076 -.083* 
Education 0.027 0.056 -0.002 
Healthcare 0.073 .096* 0.027 


5. EPISTEMIC NETWORK ANALYSIS 


ENA measures the connections between elements in data and 
represents them in dynamic network models. ENA creates these 
network models in a metric space that enables the comparison of 
networks in terms of (a) difference graph that highlights how the 
weighted connections of one network differ from another; and (b) 
statistics that summarize the weighted structure of network con- 
nections, enabling comparisons of many networks at once. 


ENA was originally developed to model cognitive networks in- 
volved in complex thinking. These cognitive networks represent 
associations between knowledge, skills, habits of mind of individ- 
ual learners or groups of learners. In this study, we used ENA to 
construct network models. For each individual student, we con- 
structed an ENA network using the selected seven topic scores for 
each utterance the student contributed to the group. 


5.1 Process 

While the process of creating ENA models is described in more 
detail elsewhere (e.g. [11; 17-19]), we will briefly describe how 
ENA models are created based on topic modeling. Here we de- 
fined network nodes as the seven topics identified from the topic 
model. We defined the connections between nodes, or edges, as 
the strength of the co-occurrence of topics within a moving stanza 
window (MSW) of size 5 [19]. To model connections between 
topics we used the products of the topic scores summed across all 
chats in the MSW. That is, for each topic, the topic scores are 
summed across all 5 chats in the MSW. Then ENA computed the 
product of the summed topic loadings for each pair topics to 
measure the strength of their co-occurrence. For example, if the 
sum of the topics scores across five chats was 0.5 for “illness”, 0.3 
for “psychology”, and 0.2 for “healthcare”, these scores would 
result in three co-occurrences, “illness-psychology”, “illness- 
healthcare”, and “psychology-healthcare”, with scores of 0.15, 
0.1, and 0.06, respectively. 


Next ENA created adjacency matrices for each student that quan- 
tified the co-occurrences of topics within the students’ discourse 
in the context of their chat group. Subsequently, the adjacency 
matrices were then treated as vectors in a high dimensional space, 
where each dimension corresponds to co-occurrence of a pair of 
topics. The vectors were then normalized to unit vectors. Notice 
that the normalization removed the effect of chat length embedded 
in the topic scores. A singular value decomposition (SVD) was 
then performed for dimensional reduction. ENA then projected a 
vector for each student into a low dimensional space that maxim- 
izes the variance explained in the data. Finally, the nodes of the 


networks, which in this case correspond to the seven selected 
topics generated from TASA corpus, were placed in the low di- 
mensional space. The topic nodes were placed using an optimiza- 
tion algorithm such that the overall distances between centroids 
(centers of the mass of the networks) and the corresponding pro- 
jected student locations was minimized. A critical feature of ENA 
is that these node placements are fixed, that is, the nodes of each 
network are in the same place for all units in the analysis. This 
fixing of the location of the nodes allows for meaningful compari- 
sons between networks in terms of their connection patterns 
which allow us to interpret the metric space. As a result, ENA 
produced two coordinated representations: (1) the location of each 
student in a projected metric space, in which all units of analysis 
included in the model were located, and (2) weighted network 
graphs for each student, which explained why the student was 
positioned where it was in the space. 


ENA also allows us to compare the mean network graphs and 
mean position in ENA space between different groups of stu- 
dents. In this study, we only considered the students with high 
potential to learn, 1.e., the 624 students with pretest score < 0.5 
(50% correct). Among these students, we compared the networks 
of low learning gain students (gain<-0.1, N=194) with the net- 
works of high learning gain students (gain>0.43, N=105). We 
compared these groups using difference network graph, which 
was formed by subtracting the edge weights of the mean discourse 
network for the low gain group students from the mean discourse 
network from the high gain group. This difference network graph 
shows us which topic connections are stronger for each group. In 
addition, we conducted a f-test to test the difference between 
group means. 


5.2 Results 


Figure 3 shows mean discourse networks for students with low 
gain scores (left, red), students with high gain scores (right, blue), 
and a difference network graph (center) that shows how the dis- 
course patterns of each group differs. Students with low gains had 
stronger connections between the “people/social” topic and all 
other topics except for “illness”. More importantly, the connec- 
tion that was the strongest for low gain students compared to high 
gain students was between “people/social” and “outdoors”. Stu- 
dents with high gain scores made stronger connections between 
the topics of “illness”, “psychology”, “healthcare”, “biology”, and 
“education”. 


Table 4. Comparison of centroids between low gain and high 
gain students, p = 0.047,t = 2.00 


Mean SD 
High gain 0.033 0.220 
Low gain -0.048 0.322 


Figure 4 shows centroids, or the centers of mass, of individual 
students’ discourse networks and their means with low gain score 
students in red and high gain score students in blue. The differ- 
ences between these two groups were significant on the x dimen- 
sions (see table 4). This means that the differences we saw in 
figure 2 and described above are statistically significant. In other 
words, the high learning gain students’ discourse was more to- 
wards the right side of the ENA space and the low learning gain 
students’ discourse was more towards the left side. That indicates 
that the discourse of students with high learning gains made more 
connections between on-task topics (“illness”, “psychology”, 
‘healthcare’, “biology”, and “education”), while the discourse of 
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low gain students made more connections between off-task topics 
(“people/social” and “‘outdoors’’). 


6. DISCUSSION 


ENA makes it possible to visualize the chat dynamics to help 
researchers gain deeper understanding of what is going on in a 
collaborative learning environment. Differences in what topics 
students connect in discourse can predict learning outcomes. Pre- 
vious use of ENA has relied on human coded data or use of regu- 
lar expressions to classify data. Utilizing topic modeling can lead 
to fully automated ENA, making it more accessible to a wider 
group of researchers and allows ENA to be used with more and 
larger data sets. 


The fact that the epistemic network predicts learning validates 
further application of ENA. For example, the turn by turn chat 
dynamics can be plotted as trajectories in the 2-D space, where the 
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topics are placed. Investigating the trajectory patterns and their 
relationship to learning or socio-affective components are interest- 
ing future research directions. 


We used a general topic model in this study. Many studies in the 
literature used LDA for topic modeling on relatively small corpo- 
ra. This causes two problems. 1) LDA topic models built upon 
small corpora are not reliable, because LDA requires large num- 
ber documents with relatively large size for each document. Inad- 
equate corpus can result in misleading results. 2) Using a topic 
model that is not common would result in arbitrary interpretation. 
For example, the representation of “illness” from different corpus 
could be very different. Therefore, it is hard to compare the claims 
made to “illness” across different studies. Using a reliable, com- 
mon topic models will set up a common language for different 
studies. 


lllness llness 
= 


* Healthcare 


@ Biology 


Figure 3: Mean discourse networks for students with low gain scores (left, red), students with high gain scores (right, blue), and a 
difference network graph (center). 


(22%) 


Figure 4: Discourse network centroids low gain score students 
red, high gain score students blue. 


Topic scores for documents are usually inferred from topic mod- 
els. While for longer documents, the topic scores can be used in 
many applications (e.g., text clustering [1]), the inferred topic 
proportion scores won’t be useful for analyzing chats if we need 
to treat each utterance as a unit of analysis. It is not useful because 


chat utterances are too short. The statistical inference algorithm 
contains a high degree of randomness for short documents. As an 
extreme example, an utterance with a single word, would result in 
inferred topic proportion scores with “1” on one topic and “0” on 
others. The problem is that, this “1” was assigned to a topic with 
certain degree of uncertainty. That is, the topic this “1” was as- 
signed to could be any topic. While aggregated analysis may not 
be sensitive to such uncertainty, detailed utterance by utterance 
analysis would suffer from it. 


Our method of computing topic scores is based on the topic prob- 
ability distribution over each word. We treat the topic distribution 
of each word as a vector. When computing the topic score, the 
simple sum of all word vectors gives scores to all topics. As we 
have pointed out, the summation algorithm will have a length 
effect. Therefore, when such topic scores are used, removing 
length effects through normalization is necessary. In this article, 
we did not use weighted sum as suggested in Cai et al. [4]. Com- 
paring the effect of different weighting is beyond the scope of this 
paper. 

When a general topic model is used, selecting topics relevant to 
the specific analysis becomes important. Our approach was to 
look at the total scores of utterances and find the “hot” topics by 
sorting the total topic scores. In our study, we had a quickly de- 
creasing curve that helped us to select topics. We believe this 
would be the case for most studies using a model containing far 
more topics than the topics contained in the target data. 
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Although our study started with topic modeling to capture the 
“what” in the chats, the association networks constructed in the 
epistemic network analysis actually turned the “what” into a 
“how”: how the topics in the chats associated with each other. 
This is conceptually similar to the cohesion features Dowell [7] 
and Cade [3] used. 


Topic modeling emphasizes content words. When a topic model is 
built, stop words are usually removed. An interesting question is, 
what if we do the opposite: keep stop words and remove content 
words? Pennebaker (e.g., [13]) laid foundational work in this di- 
rection. The LIWC tool Pennebaker and his colleagues created 
provides over a hundred text measures by counting non-content 
words. LIWC measures could provide different features to epis- 
temic network analysis and reveal different aspects of the chat 
dynamics. 
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ABSTRACT 


In this study, we applied decision trees (DT) to extract 
a compact set of pedagogical decision-making rules from 
an original full set of 3,702 Reinforcement Learning (RL)- 
induced rules, referred to as the DT-RL rules and Full-RL 
rules respectively. We then evaluated the effectiveness of 
the two rule sets against a baseline Random condition in 
which the tutor made random yet reasonable decisions. We 
explored two types of trees (weighted and unweighted) as 
well as two pruning strategies (pre- and post-pruning). We 
found that post-pruned weighted trees produced the best re- 
sults with 529 DT-RL rules. The empirical evaluation was 
conducted in a classroom study using an existing Intelligent 
Tutoring System (ITS) named Pyrenees. 153 students were 
randomly assigned to three conditions. The procedure was 
the same for all students with domain content and required 
steps strictly controlled. The only substantive differences 
between the three conditions were the policy: (Full-RL vs. 
DT-RL vs. Random). Our result showed that as expected 
the machine induced policies (Full-RL and DT-RL) are sig- 
nificantly more effective than the random policy; more im- 
portantly, no significant difference was found between the 
Full-RL and DT-RL policies though the number of DT-RL 
rules is less than 15% of the number of the Full-RL rules 
and the former group also took significantly less time than 
the latter. 


1. INTRODUCTION 


Intelligent Tutoring Systems (ITSs) are interactive e-learning 
environments that support students’ learning by providing 
instruction, scaffolded practice, and on-demand help. The 
system’s behaviors can be viewed as a sequential decision- 
making process where at each step the system chooses an 
appropriate action from a set of options. Pedagogical strate- 
gies are the policies used to decide what action to take next 
in the face of alternatives. Each system decision will affect 
the user’s subsequent actions and performance. Its impact 
on outcomes cannot always be immediately observed and the 
effectiveness of each decision depends upon the effectiveness 
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of subsequent actions. Ideally, an effective learning environ- 
ment will adapt its decisions to users’ specific needs [1, 11]. 
However, there is no existing well-established theory on how 
to make these system decisions effectively. Generally speak- 
ing, prior research on pedagogical policies can be divided 
into two general categories: top-down or theory-driven, and 
bottom-up or data-driven. 


In theory-driven approaches, ITSs employ hand-coded ped- 
agogical rules that seek to implement existing cognitive or 
learning theories [1, 10, 17]. While existing learning liter- 
ature gives helpful guidance on the design of pedagogical 
rules, such guidance is often too general to implement as 
effective immediate decisions. For example, the aptitude- 
treatment interaction (ATI) theory states that instructors 
should match their interventions to the aptitude of the learner 
[5]. While the principle behind this theory is understand- 
able, it is not clear how to implement that rule for each 
decision. How do we represent learner’s aptitude for each 
equation, how exact should be the system’s adaptation, and 
so on. 


Data-driven approaches, on the other hand, derive peda- 
gogical policies directly from prior data. Here the policies 
specify the pedagogical decisions at a detailed level. Rein- 
forcement Learning (RL), which we use here, is one popular 
approach that is able to derive pedagogical policies directly 
from student-system interaction logs. These policies are de- 
fined as a set of state-action mapping rules, which give the 
best decision to take in each state. The states are typically 
represented as sets of features and the actions are pedagog- 
ical actions such as presenting a worked example (WE) or 
requiring the student to solve problems (PS). When the sys- 
tem presents a worked example, the students will be given a 
detailed example showing a complete expert solution for the 
problem or the best step to take given their current solution 
state. In Problem Solving, by contrast, students are tasked 
with solving a problem using the ITS or with completing an 
individual problem-solving step. 


For this project, our original complete RL-induced policy in- 
volves the following seven features representing the students’ 
learning process from different perspectives’. 


"In the format of: [Feature-Name] (Discretization Proce- 


dure): Explanation of the feature. 
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1. [WAWESincePS] (0 — 0;(0,1] — 1;(1,+00) — 2): 
The number of worked example (WE) steps received 
since the last problem solving (PS) step. 


2. [timeInSession] ((0, 2290] — 0; (2290, 4775] > 1; 
(4775, 7939] — 2;(7939,+00) — 3): The total time 


spent in the current session. 


3. [avgTimeOnStepPS] ([0, 29.01] — 0; (29.01, 
48.71] — 1; (48.71, +00) > 2): The average amount of 
time spent on each PS step. 


4. [avgTimeOnStepSessionPS] ([0, 23.51] — 0; 
(23.51, 36.56] — 1; (36.56,55] — 2;(55,+00) — 3): 
The average amount of time spent on each PS step in 
the current session. 


5. [nStepSinceLastWrongKC] ((0, 1] > 0; (1, 7] 
— 1; (7, 25] > 2; (25, +00) — 3): The number of steps 
received since the last wrong PS step on the current 
knowledge component (KC). 


6. [aWEStepSinceLast Wrong] ((0, 1] > 0; (1, 4] 
— 1;(4,10] — 2; (10,-+00) > 3): The number of WE 
steps since the last wrong PS step. 


7. [aCorrectPSStepSinceLast WrongKCSession| 
(0 — 0;(0,3] — 1;(3,10] > 2;(10,+00) > 3): The 
number of correct PS steps since the last wrong PS 
step on the current KC in the current session. 


With this feature set, a state can be represented as a 7- 
dimensional vector where each element denotes a discretized 
feature value. Then, the rules can then be represented as: 


(0:0:0:0:0:0:0) -> PS 
(0:0:0:0:0:0:1) -> PS 
(0:0:0:0:0:1:0) -> PS 
(0:0:0:0:0:1:1) -> WE 


In this study we discretized the features into three-four val- 
ues producing a seven-feature state. This results in a state 
space of 3° «4° = 9216, that is 9216 rules in one RL-induced 
policy. While these types of polices can specify the exact 
action to take in each case, they are usually too narrow to 
be aligned to existing learning theories. Each of the rules 
covers only a very specific case and the relationship between 
rules is unknown. Thus it is impossible to explain the power 
of those rules from the perspective of learning theory. The 
opacity of those induced rules not only hinders us in improv- 
ing data-driven methodologies when they go wrong, it also 
prevents us from advancing learning science research more 
generally. Moreover, it is possible that some of the decisions 
are environment-specific and may not generalize to other 
contexts. This in turn prevents translating these induced 
policies to environments other than the one from which they 
are induced. Therefore, a general method is needed to shed 
some light on the extracted detailed data-driven policies. 


Decision tree (DT) induction is a robust data mining ap- 
proach which can be used to extract a compact set of rules 
from a set of specific examples. It builds a tree-like hierar- 
chical decision-making pattern which represents the knowl- 
edge it learned. Each path from root to leaf represents a 
single rule which may be dealt with separately. Prior stud- 
ies have shown that DT’s can match training examples in 
most cases, even with relatively small trees. Davidson et 
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al., for example, built a DT for predicting the extinction 
risk of mammals [6]. Each of the species was described by 
11 ecological features (e.g body mass, geographic range and 
population density) and were labeled with their extinction 
risk (threatened vs. non-threatened). Their tree contained 
20 general rules which covered 4500 training examples, with 
a decision accuracy over 80%. Additionally, Reinchard et al. 
built a DT for predicting the invasiveness of woody plants 
[13]. The resulting DT encoded 15 rules from 235 examples, 
with a decision accuracy over 76%. Therefore, in our study, 
we will apply D'T to extract general pedagogical decision- 
making rules from the detailed RL-induced policies. 


In short, our primary research question is: is DT an ef- 
fective methodology for extracting more general pedagogical 
rules from the detailed RL-induced pedagogical rules? In or- 
der to investigate this question, we will build D's using the 
rules in a RL-induced policy as training examples and em- 
pirically evaluate the effectiveness of the extracted set of DT 
rules by comparing it to the full set of RL-induced rules in a 
classroom study. The state features in the RL-induced poli- 
cies are the input features for the DT and the pedagogical 
actions are the output labels. In our empirical evaluation, 
we separate the pedagogical decisions from the instructional 
content, strictly controlling the content so that it is equiva- 
lent for all participants by 1) using an ITS which provides 
equal support for all learners; and 2) focusing on tutorial 
decisions that cover the same domain content, in this case 
WE versus PS. 


2. BACKGROUND 


2.1 Applying RL to ITSs 


Beck et al. applied RL to induce pedagogical policies that 
would minimize the time students take to complete prob- 
lems on AnimalWatch, an ITS for grade school arithmetic 
[2]. They trained the model with simulated students. The 
low cost of generated data allowed them to apply a model- 
free RL method, Temporal Difference learning. During the 
test phase, the induced policies were added to Animal Watch 
and the new system was empirically compared with the orig- 
inal system. Their results showed that the policy group 
spent significantly less time per problem than their no-policy 
peers. Note that their primary goal was to reduce the amount 
of time per problem, however faster problem-solving does 
not always result in better learning performance. Nonethe- 
less, their results showed that RL can be successfully applied 
to induce pedagogical policies for ITSs. 


Iglesias et al., on the other hand, focused on applying RL to 
improve the effectiveness of an Intelligent Educational Sys- 
tem that teaches students DataBase Design [8, 9]. They 
applied another model-free RL algorithm, Q-learning to in- 
duce policies that provide students with direct navigation 
support through the system’s content. ‘They used simulated 
students to induce the policy and empirically evaluated its 
effectiveness on real students. Their results showed that 
while the policy led to more effective system usage behav- 
iors from students, the policy students did not outperform 
the no-policy peers in terms of learning outcomes. 


Shen investigated the impact of both immediate and de- 
layed reward functions on RL-induced policies and empiri- 
cally evaluated the effectiveness of the induced policies within 
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an Intelligent Tutoring System called Deep Thought [15]. 
The induced pedagogical policies are used to decide whether 
the next task should be WE or PS. They found that some 
learners benefited significantly more from effective pedagog- 
ical policies than others. 


Finally, Chi et al. applied model-based RL to induce peda- 
gogical policies to improve the effectiveness of an Intelligent 
Natural Language Tutoring System for college-level physics 
called Cordillera [4]. The authors collected an exploratory 
corpus by training human students on an ITS that makes 
random decisions and then applied RL to induce pedagogi- 
cal policies from the corpus. They showed that the induced 
policies were significantly more effective than the prior ones. 


In short, prior studies have shown that RL-induced ped- 
agogical policies can improve students’ learning or reduce 
training time. However, all of these studies focused on the 
effectiveness of the RL-induced policies. None of them con- 
sidered extracting more general rules from the induced poli- 
cies. 


2.2 Extracting General Rules 

In addition to the work of Davidson et al. [6] and Reinchard 
et al. [13], DTs have been used for other tasks. Vayssiers 
et al., for example, applied Classification And Regression 
Trees to predict the presence of 3 species of oak in Califor- 
nia [18]. Their training examples were Vegetation Type Map 
records for 2085 unique locations. Each record consisted of 
25 climatic and geographic features as well as 3 labels show- 
ing the presence of the species (Quercus agrifolia, Quercus 
douglasii and Quercus lobata). One DT was induced for 
each type. The DT’s were tested on another dataset which 
contains the same type of records for 2016 locations. For 
Quercus agrifolia, the induced tree had 10 leaf nodes and 
94.9% of its predictions are correct for the locations that 
have the presence of this oak (sensitivity) while 86.7% of 
its predictions are correct for cases without the oak (speci- 
ficity). For Quercus douglasii, the induced tree had 22 leaf 
nodes and a sensitivity and specificity of 87% and 79.9% 
respectively. For Quercus lobata, the tree had 6 leaves but 
reached a sensitivity of 77% and a specificity of 73.3%. 


Thus, prior studies have shown that DT can effectively ex- 
tract a small set of general decision-making rules from a 
large set of specific examples. However, all the examples 
used by these studies were observations of existing phenom- 
ena. So far as we know, this work is the only relevant re- 
search on the application of DT to extract a compact set 
of decision-making rules directly from full RL-induced rules 
and empirically evaluated the two sets of the rules. 


2.3. Applying DT to RL 


Prior research on incorporating DT with RL has largely 
focused on seeking a better representation of state space 
or policy for RL. Boutilier et al [3]. proposed representa- 
tional and computational techniques for Markov Decision 
Processes (MDPs) to reduce the size of the state space. 
They used dynamic Bayesian networks and D'l's to repre- 
sent stochastic actions as well as D'T’s to represent rewards. 
Based upon this representation, they then developed algo- 
rithms to find conditional optimal policies. Their method 
was empirically evaluated on several planning problems and 
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they showed significant savings in both time and space for 
some types of problems. Gupta et al. proposed the Policy 
Tree algorithm for RL. This algorithm is designed to directly 
induce a functional representation of the conditional optimal 
policies as a DT. They evaluated it on a variety of domains 
and showed that it was able to make splits properly [7]. 


In short, prior researchers have shown that properly com- 
bining DT with RL can result in a large amount of savings 
in time and space for finding good policies. However, none 
of these studies directly applied DT on RL-induced policies. 


3. INDUCE FULL SET OF RL-POLICY 


Previously, researchers have typically used the Markov De- 
cision Process (MDP) [16] framework to model user-system 
interactions. ‘The central idea behind this approach is to 
transform the problem of inducing effective pedagogical poli- 
cies on what action the agent should take to the problem of 
computing an optimal policy for an MDP. 


3.1 Markov Decision Process 

An MDP is a mathematical framework for representing an 
RL task. It is defined by: a tuple (S, A, 7,R). Where S = 
{:S1,S2,..., Sn} denotes the state space; A = {A1, Ao,..., Am} 
represents a set of agent’s possible actions; and IT’: S x A x 
S — [0,1] is a transition probability table, where each el- 
ement is Tg,g, = p(S;|Si,a). This in turn indicates the 
probability of transiting from state S; to state S; by tak- 
ing an action a while R: S x Ax S — R assigns rewards 
to state transitions given actions. The policy is defined as 
mq: 5 — A, mapping state S into action A with the goal of 
maximizing the expected reward. 


After defining an MDP, we can transfer the student-system 
interaction dialog into the trajectory which can then be rep- 
resented as follows: 


Ai,R Ao,R A3,R 
GS ey St 


Where S; es S;+1 means that the tutor executed action 
A; and received reward R; in state S;, and then transferred 
to the next state S;,1. In general, the reward can be divided 
into two categories, immediate and delayed, where immedi- 
ate rewards are received during the state transition, and 
delayed are available after reaching to goal state. 


3.2 Training Datasets 

Our training dataset was collected from three exploratory 
studies in which students were trained on an IT'S which made 
random yet reasonable pedagogical decisions. The studies 
were given as homework assignments during CSC226: Dis- 
crete Mathematics, a core CS course offered at NCSU dur- 
ing the Fall 2014, Spring 2015 and Fall 2015 semesters. The 
dataset contains a total of 149 students’ interaction logs. 
All students used the same IT'S, followed the same general 
procedure, studied the same training materials, and worked 
through the same training problems. In order to model the 
students’ learning process, we extracted a total of 142 state 
feature variables, which can be grouped into five categories: 


1. Autonomy (AM): the amount of work done by the stu- 
dent: such as the number of problems solved so far PS Count 
or the number of hints requested hintCount. 
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2. Temporal Situation (TS): the time related informa- 
tion about the work process: such as the average time taken 
per problem avgTime, or the total time spent solving a prob- 
lem TotalPS Time. 

3. Problem Solving (PS): information about the current 
problem solving context, such as the difficulty of the current 
problem probDiff, or whether the student changes the diffi- 
culty level NewLevel. 

4. Performance (PM): information about the student’s 
performance during problem solving: such as the number of 
right application of rules RightA pp. 

5. Student Action (SA): the statistical measurement of 
student’s behavior: such as the number of non-empty-click 
actions that students take actionCount, or the number of 
clicks for derivation AppCount. 


3.3. Inducing RL Policies 


In order to apply RL to induce pedagogical policies, we 
first defined the pedagogical decision-making problem as an 
MDP. The state representation includes all of the relevant 
features available at the beginning of each step. The ac- 
tions are WE and PS at the step level. The transition ta- 
bles were calculated on our training dataset, and our reward 
function includes two types of reward: delayed and imme- 
diate. Our most important reward is based on normalized 
learning gain (NLG) (2 ee), which measures the 
students’ learning gains irrespective of their incoming com- 
petence. This reward was given as a delayed reward as NLG 
scores can only be calculated after students finish the entire 
training process. However, Shen et al. [15] showed that giv- 
ing immediate rewards can lead to the production of more 
effective policies when compared to delayed rewards. This 
is known as the credit-assignment problem. ‘The more that 
we delay success measures from a series of sequential deci- 
sions, the more difficult it becomes to identify which of the 
decision(s) in the sequence are responsible for our final suc- 
cess or failure. Therefore, for the purposes of this study we 
also assigned immediate rewards based upon the students’ 
performance during training on the system. 


The value iteration algorithm was applied to find the optimal 
policy. This algorithm operates by finding the optimal value 
for each state V*(s). The optimal value for a given state is 
the expected discounted reward that the agent will gain if 
it starts in s and follows the optimal policy to the goal. 
Generally speaking, V*(s) can be obtained by the optimal 
value function for each state-action pair Q*(s,a) which is 
defined as the expected discounted reward the agent will 
gain if it takes an action a in astate s and follows the optimal 
policy to the end. The optimal state value V*(s) and value 
function Q*(s,a) can be obtained by iteratively updating 
V(s) and Q(s,a) via equations 1 and 2 until they converge: 


Q(s,a) := R(s,a)+7>_ p(S;|Si,a)V(s’) (1) 


s’ES 


max Q(s,a) (2) 


Here, p(.$;|.S;, a) is the estimated transition model T,, R(s, a) 
is the estimated reward model and 0 < y < 1 is a discount 
factor. 


V(s) 


To induce effective pedagogical policies, we combined RL 


based methods and an ensemble method and capped the 
maximum number of state feature size to be eight. More 
details of our feature selection methods are described in [14]. 
The final resulting RL policy involves seven state features 
and 3706 rules. 


4. EXTRACTING COMPACT DT-RL SETS 


In order to extract a more compact set of decision-making 
rules from the full set of RL-induced rules, we implemented 
the ID3 algorithm to build DTs [12]. Each rule in the final 
RL-induced policy was used as a training example. ‘Two 
types of decision trees were built: unweighted and weighted, 
as well as two types of pruning strategies were implemented: 
pre- and post-pruning. Next, we will discuss each of them 
in turn. 


4.1 Unweighted vs. Weighted Tree 


The decision to give a WE vs. PS may impact students’ 
learning differently in different situations. We therefore built 
two types of decision trees: unweighted and weighted. Un- 
weighted trees treated each decision equally while wezghted 
trees take account of the relative importance of each peda- 
gogical rule. When applying the value iteration algorithm 
to induce the optimal policy, we generate the optimal value 
function Q*(s,a), which gives the expected discounted re- 
ward each agent will gain if it takes an action a in a state s 
and follows the optimal policy to the end. For a given state 
s, a large difference between the values of Q(s, “PS”) and 
Q(s, “WE” ) indicates that it is more important for the ITS 
to follow the optimal decision in the state s. We therefore 
used the absolute difference between the Q values for each 
state s to weight each RL pedagogical rule. 


The ID3 algorithm builds a tree recursively from root to 
leaves. On each iteration of the construction process the 
algorithm will check the state of the dataset for the current 
branch. It will then select a test feature for the current 
node based upon the weighted information gain. The current 
node will then be expanded by adding branches to it, each 
of which represents a possible value for the selected feature. 
The data will be partitioned over the branches according to 
the value of the test feature. The selected feature cannot 
be used again by its children. Weighted information gain is 
defined by the difference between the weighted entropy of the 
examples before it is selected and after they are separated 
by feature value. The weighted entropy of a node can be 
calculated by equation 3 


J 
H(G) = —)— p(i|@)logop(i|G) (3) 

i=l 
J is the total number of output label classes. In our case, 
it is the number of pedagogical actions (WE or PS) which 
is 2. p(i|G) is the weighted frequency defined by the equa- 


; ; : _ eee Wy : ; 
tion: p(i|G) = Soe re; We is the total weight of the 
examples which are in node G and which belong to class 27. 


And 5° Wy is the total weights of examples in node G. 


yeEG 


The information gain of spliting the current set of training 
examples using feature F’ can be calculated by equation 4: 


IG(F,G) = H(G) — ) _p(t)|G) Hs) (4) 


with various feature selections including 10 types of correlation- 
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p(t; |G) is the weighted frequency of the examples in node G: 
dia p=t,ceEG Wa 


p(t; |G) = paar Wy 
of examples in nodes G whose value of feature F’ is 7 and 
yea Wy is the total weight of examples in nodes G. 


» epatreg We is the total weights 


4.2 Pre-Pruning and Post-Pruning 

To control the size of rules induced by DT, we examined 
two types of pruning strategy: pre- and post-pruning. The 
pre-pruning is conducted during the process of building the 
tree and it used the information gain to determine whether 
to expand or to terminate. Only nodes with an information 
gain greater than a threshold times its depth: [G(F,G) > 
? x De will be expanded and others will be made as a leaf. 
? is a fixed threshold and Dg is the depth of node G. 


Post-Pruning is conducted after the whole decision tree is 
built and it used the error rate as the pruning measure. The 


error rate before a node is expanded is defined as: eg = 
ie T Wi 
[G| 
by node G and |G] is the total number of examples in the 
node G. The error rate after a node is expanded is defined 
as: €c = peo Lae C' is the set of children nodes 
of G after it is expanded and J, is the set of the decisions 
incorrectly classified by the node c. In post-pruning, if the 
difference of a node’s error rate from before to after split is 
less than a threshold, the node will be pruned by removing 
all of its branches to make it a leaf node. 


I is the set of the decisions incorrectly classified 


4.3. The Compact Set of DT-RL Rules 

In order to induce a compact set of DT-RL rules, we ap- 
plied the DTs to the full set of 3706 RL-induced rules. The 
induced unweighted and weighted D'T’s without pruning has 
2527 and 2456 rules (leaf nodes) respectively. Thus, with- 
out pruning, D'l'’s are already able to extract a smaller set 
of rules: it reduced the total number of rules by over 1000. 


Figure 1 shows the relationship between the number of leaf 
nodes (x-axis) and the inverted weighted accuracy (y-axis). 
Weighted accuracy(W A) is the weighted percentage of deci- 


sions correctly made, which can be calculated by the equa- 
tion: WA = aye T is the set of correct predictions 
made by a DT and w, is the weight of decision 7. The in- 
verted weighted accuracy (IW A) is IWA = WA7?°, the 
lower the better. Since our goal is to find a good balance 
point between the IWA and the number of leaf nodes, we 
applied a widely used strategy called the Elbow Method, 
to select the best tree. As we can see in the figure, the 
elbows for the two unweighted tree approaches are around 
800 and 1700 rules (x-axis) for the pre and post pruning 
respectively while the elbows for the two weighted tree ap- 
proaches are around 250 and 500 for the pre and post prun- 
ing respectively. So it seems that weighted tree can extract 
more compact set of rules than the unweighted trees. While 
the weighted pre-pruning approach has around 250 rules, 
its IWA is much higher than the weighted post-pruning ap- 
proach. Therefore, we chose the weighted tree with post- 
pruning strategy which has the an elbow at about 500 leaf 
nodes and reasonable TWA. 


To further justify our DT choice, Table 1 shows the relation- 
ship between the pruning thresholds, WA and the number 
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of leaf nodes for the weighted tree with post-pruning. Ta- 
ble 1 shows that the tree with the closest number of leaves 
to 500 is the 529 one. It can be obtained by apply a pruning 
threshold of 0.8 and the result tree has a weighted accuracy 
of 0.76. The rules in the resulted tree will be the rules used 
in the DT-RL condition. 


In short, we applied D'T on RL-induced pedagogical policies 
to extract a more compact set of decision-making rules. The 
effectiveness of the original full set and the compact set of 
policies were empirically compared against a baseline policy 
which makes random yet reasonable decisions: PS vs. WE. 
Thus, we have three conditions: 


1. Full-RL: the full set of 3706 RL-induced rules. 
2. DT-RL: the compact set of 529 DT-induced RL rules. 
3. Random: the random yet reasonable policy. 


5. EMPIRICAL EXPERIMENT 


Participants: ‘This study was conducted in the under- 
graduate Discrete Mathematics course at the Department 
of Computer Science at NC State University in the Fall of 
2016. 153 students participated in this study, which was 
given as their final homework assignment. 


Conditions: Students in the study were assigned to three 
conditions via balanced random assignment based upon their 
course section and performance on the class mid-term exam. 
Since the primary goal of this work is to examine the ef- 
fectiveness of the two RL based policies, we assigned more 
students to the Full-RL and DT-RL conditions than in the 
random condition. The final group sizes were: N = 61 (Full- 
RL), N = 51 (DT-RL), and N = 41 (Random). 


Due to preparations for exams and length of the experiment, 
126 students completed the experiment. 5 students were 
excluded from the subsequent analysis due to perfect pretest 
scores, working in group or gaming the system during the 
training. ‘The remaining 121 students were distributed as 
follows: N = 45 for Full-RL; N = 41 for RL-DT; N = 35 
for Random. We performed a y” test of the relationship 
between students’ condition and their rate of completion 
and found no significant difference among the conditions: 
(2) = 0.955, p = 0.620. 


Probability Tutor: Pyrenees is a web-based ITS for prob- 
ability. It covers 10 major principles of probability, such 
as the Complement Theorem and Bayes’ Rule. Pyrenees 
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provides step-by-step instruction and immediate feedback. 
Pyrenees can also provide on-demand hints prompting the 
student with what they should do next. As with other sys- 
tems, help in Pyrenees is provided via a sequence of in- 
creasingly specific hints. The last hint in the sequence, the 
bottom-out hint, tells the student exactly what to do. For 
the purposes of this study we incorporated three distinct 
pedagogical decision modes into Pyrenees to match the three 
conditions. 


Procedure: In this experiment, students were required to 
complete 4 phases: 1) pre-training, 2) pre-test, 3) training on 
Pyrenees, and 4) post-test. During the pre-training phase, 
all students studied the domain principles through a proba- 
bility textbook, reviewed some examples, and solved certain 
training problems. The students then took a pre-test which 
contained 14 problems. The textbook was not available at 
this phase and students were not given feedback on their an- 
swers, nor were they allowed to go back to earlier questions. 
This was also true of the post-test. 


During phase 3, students in all three conditions received 
the same 12 rather complicated problems in the same order 
on Pyrenees. Each main domain principle was applied at 
least twice. The minimal number of steps needed to solve 
each training problem ranged from 20 to 50. These steps 
included defining variables, applying principles, and solv- 
ing equations. The number of domain principles required to 
solve each problem ranged from 3 to 11. All of the students 
could access the corresponding pre-training textbook dur- 
ing this phase. Each step in the problems could have been 
provided as either a WE or PS based upon the condition 
policy. Finally, all of the students completed a post-test 
with 20 problems. 14 of the problems were isomorphic to 
the pre-test given in phase 2. ‘The remaining six were non- 
isomorphic complicated problems. 


Grading Criteria: The test problems required students to 
derive an answer by writing and solving one or more equa- 
tions. We used three scoring rubrics: binary, partial credit, 
and one-point-per-principle. Under the binary rubric, a so- 
lution was worth 1 point if it was completely correct or 0 
if not. Under the partial credit rubric, each problem score 
was defined by the proportion of correct principle applica- 
tions evident in the solution. A student who correctly ap- 
plied 4 of 5 possible principles would get a score of 0.8. The 
one-point-per-principle rubric in turn gave a point for each 
correct principle application. All of the tests were graded in 
a double-blind manner by a single experienced grader. The 
results presented below are based upon the partial-credit 
rubric but the same results hold for the other two. For 
comparison purposes, all test scores were normalized to the 
range of [0,1]. 
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6. EMPIRICAL RESULTS 

Since both the Full-RL and DT-RL policies are based on an 
RL-induced policy, we combined the two conditions together 
as the Induced group to evaluate the effectiveness the RL- 
induced policy. The evaluation was conducted by comparing 
the Induced group with the baseline Random condition on 
learning performance and training time. Moreover, in or- 
der to further discover to what extent the compact policy 
retained the power of the full policy, we compared the Full- 
RL and DT-RL conditions on the same measures. Next, we 
will discuss each of the comparisons in turn. 


6.1 Induced vs. Random 

We measured Students’ incoming competence via the pre- 
test scores collected before training took place. Table 2 
shows a comparison between the Induced group and the 
Random group in terms of learning performance. The paren- 
thesized values following the group names in row 1 denote 
the number of students in each group. The second row in this 
table shows the pre-test scores. ‘The last column shows the 
pairwise t-test results. Pairwise t-tests on students’ pre-test 
scores show that there is no significant difference between 
the two groups: ¢(119) = —0.346, p = 0.730, d = 0.069. 
Thus, despite attrition, the two groups remained balanced 
in terms of incoming competence. Next, we will compare the 
two groups in terms of learning performance in the post-test 
and training time. 


Rows 2 - 4 in Table 2 show a comparison of the pre-test, iso- 
morphic post-test (14 isomorphic questions), and adjusted 
post-test scores between the two groups along with the mean 
and SD for each. In order to examine the students’ im- 
provement through training on Pyrenees, we compared their 
scores on the pre-test and isomorphic post-test questions. 
A repeated measures analysis using test type (pre-test and 
isomorphic post-test) as factors and test score as the depen- 
dent measure showed a main effect for test type: F'(1,119) = 
98.75, p < 0.0001. Further comparisons on group by group 
basis showed that on the isomorphic questions, both groups 
scored significantly higher in the post-test than in the pre- 
test: F(1,85) = 81.30, p < 0.0001 for Induced and F'(1, 34) = 
18.30, p = 0.0001 for Random respectively. This suggests 
that the basic practice and problems, domain exposure, and 
interactivity of our ITS might help students to learn even 
when pedagogical decisions are made randomly. 


In order to investigate the effectiveness of the induced poli- 
cies, we compared students’ overall learning performance, 
which was evaluated by their adjusted post-test scores, be- 
tween the two groups. A one-way ANCOVA analysis was 
conducted on their overall post-test scores (20 questions), 
using the pretest scores as a covariate to factor out the in- 
fluence of their incoming competence. The result shows a 
significant main effect: F'(1,118) = 4.628, p = 0.033. That 
is, the Induced group significantly outperformed the Ran- 
dom group on adjusted post-test scores, which is shown in 


LIZ 


Table 2: Induced vs. Random 


Taduced(s6) 


Random(35) 


T-test Result 


686(.194) 699.171) | £(119) = —0.346, p = 0.730, d = 0.069 
'851(.155) 812(.195) | ¢(119)= 1.141, p = 0.256, d = 0.229 
Adjusted Post | .751(.144) 689(.138) | €(119) = 2.162, p = 0.033, d = 0.433 


105.87(34.30) | 111.18(27.33) | ¢(119) = —0.815, p = 0.417, d = 0.163 
205.74(62.73) | 189.46(11.39 t(119) = 1.522, p = 0.131, d = 0.305 


PS steps 


173.69(61.14) | 190.26(10.28) | ¢(119) = —1.591, p = 0.114, d = 0.319 


54.16(16.35) | 49.89(2.78) | #(119) = 1.532, p = 0.128, d =0.307 


the fourth row of ‘Table 2. Therefore, the results showed that 
the induced policies are significantly more effective than the 
random policy. 


The fifth row in Table 2 shows the average amount of total 
training time (in minutes) students spent on our ITS for each 
group. Pairwise t-test showed no significant difference in 
training time between the two groups: t(119) = —0.815, p = 
0.417, d = 0.163. The results suggest that when compared 
to the random policy, the induced policies generally do not 
have a significant different impact on students’ training time. 


The last three rows in Table 2 show the number of WE 
and PS steps given as well as the percentage of WE steps 
received by the Induced and the Random group. Pairwise 
t-tests showed that there is no significant difference between 
the two groups on these three measures. 


6.2 Full-RL vs. DI-RL 


We then performed the same comparison between the Full- 
RL and DT-RL conditions in order to examine the effective- 
ness of the DT-extracted compact policy. The second row 
in Table 3 shows the pre-test scores for each condition. A 
pairwise t-test on the scores shows no significant difference 
between the two conditions: ¢(84) = —0.168, p = 0.867, 
d = 0.036. Thus the two conditions were balanced in terms 
of incoming competence. 


The pre-test, isomorphic post-test and adjusted post-test 
scores are shown in rows 2 - 4 of Table 3. A repeated mea- 
sures analysis using test type (pre-test and isomorphic post- 
test) as factors and test score as dependent measure showed 
a main effect for test type: F'(1,85) = 81.30, p < 0.0001. 
Further comparisons on group by group basis showed that 
both conditions scored significantly higher in isomorphic 
post-test than in pre-test: F'(1,44) = 42.16, p < 0.0001 
for Full-RL and F(1,40) = 39.16, p < 0.0001 for DT-RL. 
These results suggest that the students can effectively learn 
from Pyrenees with the full and compact policies. 


In order to discover to what degree the compact policy re- 
tained the effectiveness of the full policy, we compared the 
post-test scores between the two conditions. ‘The results 
of a pairwise t-test showed no significant different between 
them on isomorphic post-test: ¢(84) = 0.505, p = 0.615, 
d = 0.109. We also conducted an ANCOVA analysis on the 
overall post-test scores using the pretest scores as a covari- 
ate and still found no significant different between the two 
conditions: F'(1,83) = 0.348, p = 0.557. In short, while on 
post-test scores, the D'T-RL condition scored slightly lower 
than the Full-RL condition, the difference is not significant. 
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The fifth row of Table 3 shows the average amount of time 
students spent on training. As the row shows, the Full- 
RL condition spent significantly more time than the DT-RL 
condition: ¢(84) = 3.829, p = 0.0002, d = 0.827. Thus 
the Full-RL and DT-RL policies have significant different 
impact upon the students’ training time. 


The last three rows of Table 3 show the number of WE 
and PS steps given and the percentage of WE steps re- 
ceived by the Full-RL and the DT-RL condition.  Pair- 
wise t-tests showed that comparing to the DT-RL condi- 
tion, the Full-RL condition received significantly fewer WE 
steps: #(84) = —4.952, p < 0.0001, d = 1.069; received a 
lower percentage of WE steps: t(84) = —4.955, p < 0.0001, 
d = 1.070; and completed more PS steps: ¢(84) = 4.999, 
p < 0.0001, d = 1.079. These results suggest that the peda- 
gogical decisions made by the compact and full policies are 
substantively different. 


7. DISCUSSION 


In this study, we applied DT to extract a compact set of 
pedagogical rules from the full set of RL-induced rules and 
empirically evaluated the effectiveness of two sets of rules in 
a classroom study. Our goal was to shed some light on the 
RL-induced policies and we think this is only the first step 
towards narrowing the gap and building a bridge between 
machine-induced pedagogical policies and learning theories. 


In order to find the best DT, we explored two types of tree: 
unweighted and weighted; and for each of them, we con- 
ducted two types of pruning strategy: pre- and post-pruning. 
After comparing the performance among them, we selected 
the weighted tree with the post-pruning strategy to perform 
the extraction of general decision-making rules. The RL- 
induced policy contains 3706 specific rules, and the compact 
DT-RL consisted of 529 rules with a weighted decision ac- 
curacy of 76%. 


In our empirical experiment, we were able to strictly control 
the domain content and thus to isolate the impact of ped- 
agogy from content. Based on this isolation, we compared 
students’ performance with the Full-RL policy, the DT-RL 
policy and the baseline random policy. Our results showed 
that students in all three conditions learned significantly af- 
ter training on Pyrenees, this suggests that the basic training 
of the ITS is effective, even when the pedagogical decisions 
are made randomly. To evaluate the effectiveness of the two 
machine induced policies (Full-RL policy and DT-RL pol- 
icy), we combined the Full-RL and DT-RL condition as the 
Induced group and compared its learning performance with 
the Random group. Our results showed that the Induced 
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Table 3: Full-RL vs. DT-RL 


Ful-RE(5) 


DT-RL (41) 


.683(.205) .690(.184) 


T-test Result 
= —0.168, p = 0.867, d = 0.036 


859(.145) 842(.168) 5 = 0.505, p = 0.615, d = 0.109 


ein 


Adj usted Post 


.739(.145) 


at p = 0.554, d = 0.128 


118.42(35.000) | 92.10 705 (84) = 3.829, p = 0.0002, d= 0.827 
177 44(48.86) | 236.80(62.03) | (84) = —4.952, p < 0.0001, d= 1.069 


PS steps 201.47(47.22 


143.20(60.57 


t(84) = 4.999, p < 0.0001, d = 1.079 


16.77(12.78) | 62.26(16.13) | #(84) = —4.955, p < 0.0001, d=1.070 


group significantly outperform the Random group. These 
results suggest that the machine induced policies are indeed 
more effective than the random policy. 


Finally, in order to examine to what extent the compact DT- 
RL policy retained the power of the full RL-induced policy, 
we compared the learning performance of the Full-RL and 
the DT-RL conditions. Our results suggest that while some 
of the power was lost in the general rules extraction, the rel- 
ative performance difference between the Full-RL and the 
DT-RL condition is not significant. In addition, our results 
on the pedagogical decisions made in training revealed that 
the compact DT-RL policy selected significant more WE 
than the Full-RL policy. This suggests that the two sets 
of policies indeed made materially different decisions. How- 
ever, since the weighted DT took account of the importance 
of each rule, the DT-RL policy aims to retain maximal de- 
cision effectiveness from the Full-RL policy while the size of 
the former is less than 15% of the size of the Full-RL rules. 
In the future, we will apply existing learning theories to the 
decision-making process generated by decision tree to find 
a theoretical basis for the DT-induced general pedagogical 
decision-making rules. 
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ABSTRACT 


In this paper, we investigate the relationship between students’ 
learning gains and their compliance with prompts fostering self- 
regulated learning (SRL) during interaction with MetaTutor, a 
hypermedia-based intelligent tutoring systems (ITS). When possi- 
ble, we evaluate compliance from student explicit answers on 
whether they want to follow the prompts, When such answers are 
not available, we mine several student behaviors related to prompt 
compliance. These behaviors are derived from students’ eye- 
tracking and interaction data (e.g., time spent on a learning page, 
number of gaze fixations on that page). Our results reveal that 
compliance with some, but not all SRL prompts provided by 
MetaTutor do influence learning. These results contribute to gain 
a better understanding of how students benefit from SRL prompts, 
and provides insights on how to further improve their effective- 
ness. For instance, prompts that do improve learning when fol- 
lowed could be the focus of adaptation designed to foster compli- 
ance for those students who would disregard them otherwise. 
Conversely, prompts that do not improve learning when followed 
could be improved based on further investigations to understand 
the reason for their lack of effectiveness 


Keywords 


Intelligent tutoring systems; Self-regulated learning; Scaffolding; 
Compliance with prompts; Learning gains; Eye tracking; Linear 
regression; Hypermedia 


1. INTRODUCTION 


There is extensive evidence that the effectiveness of Intelligent 
Tutoring Systems (ITS) is influenced by how well students can 
regulate their learning, e.g., [13, 22]. Current research has shown 
that scaffolding self-regulated learning (SRL) strategies such as 
setting learning goals or assessing progress through the learning 
content can improve learning outcomes with an ITS, e.g., [1, 10, 
22]. In particular, one of the most common approaches to scaffold 
SRL is to deliver prompts designed to guide students in applying 
specific SRL strategies as needed [22]. Previous work has focused 
on assessing the general effectiveness of such SRL prompts, for 
instance by comparing learning outcomes of students working 
with versions of the same ITS with and without the prompts. (e.g., 
[1, 19, 21]). Other work has investigated the extent to which 
students comply with the overall set of prompts generated by an 
ITS [16, 21]. However, there has been no reported study on the 
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relationship between compliance with specific SRL prompts and 
learning outcomes. In this paper, we aim to fill this gap. Specifi- 
cally, we explore the impact of student compliance with SRL 
prompts on learning gains with MetaTutor, an ITS designed to 
scaffold student SRL processes while learning about topics of the 
human circulatory system [1]. 


Our results show that student learning is influenced by compli- 
ance with some, but not all, of the SRL prompts delivered by 
MetaTutor. Overall, we found a positive impact on learning for 
compliance with prompts fostering learning strategies (revising a 
summary, reviewing notes), or planning processes (setting new 
learning goals). On the other hand, we found no impact on learn- 
ing with prompts related to metacognitive monitoring processes 
(e.g., prompts to stay on or move away from the current page 
depending on student performance on a quiz on that page). Hav- 
ing information on the efficacy of each specific prompt in a ITS is 
important to guide further research on how to improve prompts 
that do not seem to improve learning when students follow them. 
Furthermore, prompts that foster learning when followed can 
become the focus of adaptive interventions designed to improve 
compliance for those students who would disregard these prompts 
if left to their own device. 


The paper also provides initial insights into prompts design issues 
that affect how easy it is to evaluate compliance. In MetaTutor, 
some prompts explicitly asked students whether they wanted to 
follow the prompt, and then provided suitable affordance to ac- 
commodate a positive reply. Compliance with these prompts is 
easy to assess, but the additional interactions that they require 
might not always be possible, or might even be intrusive for some 
students. Other prompts did not require any specific response 
from the students. Thus, such prompts are in less danger of being 
intrusive, and provide for a more open-ended interaction. On the 
other hand, assessing compliance with these prompts is not trivial, 
because there is no clear definition of what compliance means. 
For example, one of the MetaTutor prompts asks students to re- 
read the current MetaTutor content page, but there is no obvious 
way to map this rather generic suggestion to a specific desired 
behavior (e.g., spend a specific amount of time on the page, read a 
specific number of words). We addressed this problem by running 
linear models to correlate a variety of student behaviors related to 
prompt compliance with learning. The behaviours we mined are 
based on both action and eye-tracking data (e.g., time spent on 
that page, gaze fixations on the content of the page), and our 
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Figure. 1. Screenshot of MetaTutor. 


results provide initial evidence that combining these two data 
sources can help to evaluate compliance. Thus, our findings repre- 
sent a step toward research on how to evaluate compliance with 
prompts, both for the type of off line analysis presented in this 
paper, as well as for the real-time detection of compliance neces- 
sary if we want to have ITSs that adaptively help students follow 
prompts as needed. 


The remainder of the paper starts with an overview of related 
work, followed by a description of MetaTutor and the study that 
generated the dataset we used for this research. Next, we illustrate 
how we mined data to evaluate compliance with MetaTutor’s 
prompts, the statistical analysis we conducted, and our results. 


2. RELATED WORK 


There has been extensive work on assessing the effectiveness of 
scaffolding designed to support learning with ITSs. Scaffolding 
can include prompts or hints (i.e., interventions that guide the 
student in the right direction), feedback (evaluation of students 
answers, behavior or strategies), or demonstration (e.g., worked 
examples showing expert behavior) [22, 23]. Such scaffolding can 
be domain-specific to support the acquisition of domain-specific 
knowledge, or targeting domain-independent, meta-cognitive 
learning processes such as processes for self-regulated learning 
(SRL). There is extensive evidence that both domain-specific 
scaffolding (e.g., [3, 12, 18, 20]) and meta-cognitive scaffolding 
(e.g., [2, 10, 11, 21]) can improve the effectiveness of ITS. For 
example, domain-specific hints that explain how to solve the 
current problem step have been shown to improve skill acquisi- 
tion in a variety of domains such as mathematics [20] and reading 
[3, 12]. At the meta-cognitive level, Roll et al. [21] tracked 
suboptimal help-seeking patterns (e.g., overuse of help) to deliver 
prompts and feedback on how to effectively use help. Prompts 
and feedback designed to help construct self-explanations during 
reading [10] or solving scientific problems [11] have been found 


to positively influence learning. Azevedo et al. [2] showed that 
SRL prompts and feedback effectively foster efficient use of SRL 
strategies while learning about biology. 


Research has also examined student compliance with SRL 
prompts in ITS [5, 16]. Kardan and Conati [16] examined the 
benefit of providing a variety of prompts designed to help stu- 
dents progress within an interactive learning simulation. Overall 
they found that students largely complied with the prompts and 
that providing these prompts improved learning gains. However, 
they did not explore whether and how compliance with specific 
prompts influence learning outcomes, and which prompts are the 
most effective. Bouchet et al. [5] adapted the frequency of prompt 
delivery in MetaTutor based on whether students previously com- 
plied with prompts of the same type. However, their analysis 
uncovered no influence of such adaptive prompting strategy on 
learning gains. We extend the aforementioned work on prompt 
compliance by showing how learning gains are impacted by com- 
pliance with some, but not all SRL prompts in MetaTutor. Fur- 
thermore, whereas previous solely used interaction data to evalu- 
ate compliance, we also leverage eye-tracking data when compli- 
ance cannot be inferred directly from students’ answers or actions 
(e.g., compliance with the prompts of reading a text further). 


Eye-tracking has been used in ITS to model a variety of students 
traits and behavior, e.g., emotions [14], learning outcomes [15], 
metacognitive behavior [7], or mind wandering [4]. Eye tracking 
has also been used to capture students attention to prompts [6, 8] 
and to pedagogical agents [17]. Conati et al. [6] leveraged gaze 
data to detect whether students processed domain-specific textual 
prompts in an educational game for math, and found that reading 
the prompts more extensively improved game performance. Lallé 
et al. [17] used gaze data to capture student visual attention to 
pedagogical agents in MetaTutor, and found that student learning 
gains are significantly influenced by specific metrics for visual 
attention (fixation rate, longest fixation). Eye-tracking has also 
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been used to add real-time adaptive prompts to Guru, an agent- 
based ITS for learning biology [9]. In that work, audible prompts 
designed to reorient student attention towards the screen were 
triggered if a student had not looked at the screen for more than 5s 
while Guru was providing scaffolding. This research showed that 
this gaze-reactive feedback can improve learning with Guru. In 
our work, we mine eye-tracking data to evaluate compliance with 
specific SRL prompts, and examine whether and how compliance 
with such SRL prompts influences learning gains. 


3. METATUTOR 

MetaTutor [1] 1s a hypermedia-based ITS containing multiple 
pages of content about the circulatory system, as well as mecha- 
nisms to help students self-regulating their learning with the assis- 
tance of multiple speaking pedagogical agents (PAs). When work- 
ing with MetaTutor, students are given the overall goal of learning 
as much as they can about the human circulatory system. The 
main interface of MetaTutor (see Fig. 1) includes a table of con- 
tents (Fig. 1A), the text of the current content page (Fig.1B), a 
miniature image allowing the student to display a diagram along 
with the text (Fig. 1C), the current goals and subgoals to learn 
about (Fig. 1E), a timer indicating how much time remains in the 
learning session (Fig. 1F), and an SRL palette (Fig. 1D). This 
palette is designed to scaffold students self-regulatory processes 
by providing buttons they can select to initiate specific SRL activ- 
ities (e.g., making a summary, taking a quiz, setting subgoals). 
Further SRL scaffolding is provided by three PAs in the form of 
feedback on student performance on these SRL activities (e.g., 
performance on quiz or on the quality of their summaries), as well 
as prompts designed to guide these activities as needed. The PAs 
deliver these prompts based on student behavior (e.g., time spent 
on page, number of pages visited). 


Specifically, Pam the Planner prompts planning processes pri- 
marily at the beginning of the learning session by suggesting to 
add a new subgoal and, if needed, which one to choose (e.g., path 
of blood flow, heart components). Mary the Monitor scaffolds 
students’ metacognitive monitoring processes by making them 
take quizzes on the target material when they appear to be ready 
for them. Based on quiz outcomes, Mary prompts students to 
evaluate the relevance of the current content and subgoal to their 
knowledge, and suggests how to move through the available mate- 
rial and sub goals accordingly. Sam the Strategizer prompts stu- 
dents to apply the learning strategies consisting of summarizing 
the content studied so far or reviewing notes they have taken on 
the content!. 


All PAs provide audible assistance through the use of a text-to- 
speech engine (Nuance). The PAs are visually rendered using 
Haptek virtual characters, which generate idle movements when 
the PAs are not speaking (subtle, gradual head and eye move- 
ments), as well as lip movements during speech. 


4. USER STUDY 

The data used for the analysis presented in this paper were col- 
lected via a user study designed to gain a general understanding of 
how students learn with MetaTutor [1]. The study included the 
collection of a variety of multi-channel trace data (e.g., eye track- 


' More details about the design of the agents can be found in [1]. 


ing, log files, physiological sensors). In this paper, we focus on 
using interaction and eye-tracking data to track compliance with 
the SRL prompts provided by MetaTutor, and study the relation- 
ship among compliance with the prompts and learning gains. 


Twenty-eight college students participated in the study, which 
consisted of two sessions conducted on separate days. During the 
first session, lasting approximately 30-60 minutes, students were 
administered several questionnaires, including a 30-item pretest to 
assess their knowledge of the circulatory system. During the sec- 
ond session lasting approximately three hours, students first un- 
derwent a calibration phase with the eye tracker (SMI RED 250) 
as well as a training session on MetaTutor. Each student was then 
given 90 minutes to interact with the system. Finally, students 
completed a posttest analogous to the pretest, followed by a series 
of questionnaires about their experience with MetaTutor. 


5S. DATA ANALYSIS 


5.1 Evaluating Compliance with Prompts 

In our analysis we categorize prompts into two types based on 
how compliance can be evaluated. The first type includes prompts 
for which compliance can be explicitly assessed from students 
subsequent responses (explicit compliance prompts); the second 
type includes prompts for which compliance needs to be inferred 
by mining a variety of behaviors (inferred compliance prompts). 


Explicit compliance prompts are those that: 


4 


e Require students to answer “yes” or “no” (using a dialogue 
panel that becomes active at the bottom of the display). If stu- 
dents answers yes, the only action they can perform in the 
MetaTutor interface is the one they agreed upon (e.g., adding a 
specific subgoal suggested by the agent, making or revising a 
summary, moving to a previously added subgoal or staying on 
the current one)’. 

e Require students to take a specific action within a specific time 
frame (1.e., open the diagram while they are on the current page, 
and review notes by the end of the learning session). 


Table 1 lists the explicit compliance prompts considered in this 
analysis. 


Inferred compliance prompts are those for which the PAs do not 
force students to provide an explicit answer. Specifically, after the 
agent utters one of these prompts, the student simply clicks on 
“continue” in the same dialogue panel, and can either ignore the 
prompted action, or comply at some point. These prompts (listed 
in Table 2) include all prompts related to staying on or moving 
away from the current page, as well as initiating the action of 
adding a new subgoal. 


5.2 Statistical Analysis 


Our analysis aims to investigate if and how compliance with 
MetaTutor’s SRL prompts influence learning. The variable we 


* For the “stay on current subgoal” prompt, students are not forced 
to comply after answering “yes”, but we have listed it in this 
category because student are still required to explicitly answer 
“ves” or “no” to the PAs as for whether they want to follow the 
prompt or not. 
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Table 1. List of explicit compliance prompts provided in MetaTutor (grouped by type of prompted SRL processes). 


Suggest subgoal | Recommend possible subgoals to learn about while the students is adding new subgoal. Planning processes 


Moving to next |Recommend moving on to another subgoal when the student did well on a quiz related to | 
subgoal the current subgoal. 
: Metacognitive monitor- 
Recommend to learn more about the current subgoal when the student did not do well oct 
Stay on subgoal ing processes 
enough on a quiz related to that subgoal. 


Open diagram | Recommend opening the diagram when it is relevant to the current subgoal. 


Sem rs Recommend making a summary of the current page when the student has spent enough 
time on that page. 


Revise summary 


Recommend revising the summary submitted by the student when there are issues with the 
summary (e.g., the summary is too long or too short). 


Learning strategies 


; Recommend reviewing notes taken on the learning content when approaching from the 
Review notes é 
end of the session. 


Table 2. List of inferred compliance prompts provided in MetaTutor (grouped by type of prompted SRL processes). 


Add subgoal Recommend adding a new subgoal to learn about when a student has no active subgoal. _| Planning processes 


Move to next Recommend moving on to another page when the student did well on a quiz related to the 
page current page. Metacognitive monitor- 


ing processes 


Recommend staying on the current page when the student did not well enough on a quiz 
Stay on page 
related to that page. 


adopted to measure learning in our analysis is proportional learn- 
ing gain, defined as: 


POSEtEST score rotio — preterit score rotio 
1—prefestscore yalio 
Table 3 reports statistics for pre- and post-test scores, as well as 
for the corresponding learning gains.° 


Table 3. Descriptive statistics for pretest, posttest, and 
learning gain. 


Measures of learning 
retest 


Proportional learning gain 15. 


We conducted two separate analyses for explicit and inferred 
compliance prompts, described next. 


Explicit compliance prompts. Since compliance is directly 
observed in the data for explicit compliance prompts (listed in 
Table 2), we computed a compliance rate for each of these 
prompts as follow: 


Nuriberor prompts followed 
at 


Total number of prompts delivered 


> The increase from pretest to post-test is statistically significant 
indicating that MetaTutor is overall effective at fostering learn- 
ing, as further discussed in [1]. 


Table 4 shows the compliance rate averaged across students for 
each of the seven explicit compliance prompts in MetaTutor, and 
the number of prompts delivered. 


Table 4. Descriptive statistics of the number of explicit com- 
pliance prompts delivered, as well as on compliance rate. 


| Pratant Total number of Compliance rate 
E prompts delivered Mean (SD) 


Suggest subgoaT [Us| =U) 
[Open diagram [77 ____| 20 (32) __ 
Reviewnores [28 | __46(31)___ 


To investigate the impact of compliance with explicit compliance 
prompts on learning, we ran a multiple linear regression model 
with proportional learning gain as the dependent variable, as 
well as the compliance rate for each of the seven explicit compli- 
ance prompts, and the total number of prompts received as the 
factors. For post-hoc analysis we ran pairwise t-test comparisons, 
and p-values were adjusted with the Holm-Bonferroni approach to 
account for multiple comparisons. 


Inferred compliance prompts. As stated above, for inferred 
compliance prompts (listed in Table 5), students are not forced to 
explicitly accept or ignore the prompt. This means that compli- 
ance with those prompts has to be assessed from student behav- 
iors following the prompts. One approach we considered was to 
make this assessment binary, as we did for explicit compliance 
prompts, by establishing thresholds for relevant behaviors. For 
instance, compliance with the prompt to re-read the current page 
could be assessed to be true if the student stays on the page for a 
fixed number of seconds after receiving this prompts. However, it 
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is difficult to fix these thresholds in an informed manner, as they 
may depend on the student (e.g., on a student’s readings speed, 
existing understanding of the page, etc.), and on the object of the 
prompt (e.g., on the length or difficulty of the page to be re-read). 
It is also difficult to decide which specific behaviors should be 
considered for compliance, as several might be relevant (e.g., time 
spent on a page, specific attention patterns on a page). 


Thus, for the subsequent analysis, we avoided committing to 
specific thresholds and behaviors, and we opted instead for per- 
forming regression analyses to try to relate multiple relevant com- 
pliance behaviors to learning. 


We started by building data windows that capture student data 
from the delivery of each inferred compliance prompt in Table 2, 
to the following actions: 


e “Moving to another page” for the move to next page and stay 
on page prompts; 
e “Adding a new subgoal” for the add new subgoal prompt. 


We used these data windows to derive three behavioral measures 
related to compliance: 


e Window length, capturing how long students spent before mov- 
ing on to another page or adding a new subgoal; 

e Number of fixations? made on MetaTutor’s learning content 
(text and diagram), as captured by eye tracking. We use this 
measure to understand whether students read the page and/or 
processed the diagram; 

e Number of SRL strategies initiated by the student by pressing 
the corresponding buttons in the SRL palette (see Fig. 1 D). 


Higher values of these measures (1.e., long windows, high number 
of fixations on the page and high number of SRL strategies used) 
are possible indicators that the student is processing the current 
page, e.g., the student is thinking about or reading the content (as 
captured by the length of the data window and number of fixa- 
tions on the page), or using SRL strategies on the current page. 
Thus, we hypothesized that higher values of these measures could 
reveal compliance with stay on page prompts, whereas lower 
values could reveal compliance with prompts instructing students 
to move on. Similarly, because prompts to add a subgoal requires 
moving on from the learning content to actually add a subgoal, we 
expected a short window, a small number of fixations on the page, 
and a small number of SRL strategies to indicate compliance. 


It should be noted that we could have generated other eye- 
tracking measures, such as fixation duration on the text or the 
number of transitions from the text to other components of the 
MetaTutor’s interface. However, because valid eye-tracking data 
were collected for only 16 students out of the 28 who participated 
in the study, resulting in a rather small dataset, we focused on the 
most promising behavioral measures that could be related to com- 
pliance, as a proof of concept. Table 5 shows the amount of in- 
ferred compliance prompts delivered to those 16 students. 


+ Fixation is defined as gaze maintained at one point on the screen 
for at least 80ms. 


Table 5. Number of inferred prompts delivered. 


Peomnt Total number of 
P prompts delivered 


Add a subgoal 


Stay on page 
326 


We leveraged the three aforementioned measures of student be- 
havior to investigate if complying with inferred compliance 
prompts influences learning, and if so, how. Specifically, for each 
of the three inferred compliance prompts, we ran a multiple linear 
regression model with proportional learning gain as the depend- 
ent variable, as well as the window length, number of SRL strate- 
gies performed, and number of fixations on the learning content 
as the factors. As done for explicit compliance prompts, we used 
pairwise t-test comparisons for post-hoc analysis, and all p-values 
were adjusted with the Holm-Bonferroni approach. 


6. RESULTS 


We describe below the significant? effects found in our analysis, 
first for explicit compliance prompts, and second for inferred 
compliance prompts. 


6.1 Effects for Explicit Compliance Prompts 
Our statistical analysis uncovered significant main effects of com- 
pliance rate for three explicit compliance prompts: 


e Revise summary (F\20= 6.17, p=.02, np” =.15), shown Fig. 2a. 
e Review notes (Fi20= 7.43, p=.013, np” =.16), shown Fig. 2b. 
e Suggest subgoal (Fi.20= 11.4, p=.003, np*=.27), shown Fig. 2c. 


These three main effects and related pairwise comparisons all 
reveal that students learned more when they complied more with 
these prompts than when complying less. 


These results for revise summary and review notes are consistent 
with previous findings showing these learning strategies can be 
beneficial for learning [17, 22, 24], and extend them by showing 
that prompting these strategies is effective when students comply 
with the prompts. Notably, we found a significant effect for 
prompts to revise summary, but not for prompts to summarize. 
This indicates that solely prompting to summarize is not enough 
to improve learning, and that guiding the students through the 
process of making a good summary is necessary. Results for sug- 
gest subgoal indicate that recommending a particular learning 
subgoal is useful, possibly because it is difficult for students to 
choose good subgoals by themselves. 


These results suggest to examine ways to improve compliance 
with prompts to revise summary, review notes and suggest sub- 
goal, since our analysis reveals that not complying with them 
hinders learning. For instance, MetaTutor could foster compliance 
with these prompts by explaining how they can help the students, 
or conversely force the students to follow these prompts. 


> We report statistical significance at the 0.05 level throughout 
this paper, and effect sizes as small for np* > 0.02, medium for 
np’ > 0.13, and large for 7," > 0.26. 
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c. Main effect of compliance rate with “suggest subgoal”. 
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b. Main effect of compliance rate with “review notes”. 
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d. Main effect of fixation on page after reception of “add 
subgoal”. 


Figure 2. Main effects found in this analysis, for explicit compliance prompts (charts a, b, c) and inferred conpliance prompts 
(chart d). Error bars show 95% confidence interval. 


We found no significant effects and small effect sizes (see Appen- 
dix A) for the four remaining prompts, namely summarize, stay 
on subgoal or move to next subgoal, and open the diagram. 
These results indicate it is important to study the effectiveness of 
SRL prompts individually, to identify those for which compliance 
does not improve learning. Based on these findings, it is justified 
to further investigate why complying with these prompts is not 
beneficial for learning in MetaTutor, and revise the prompts ac- 
cordingly. For example, it might be due to the nature of the 
prompts, their timing, their frequency, their wording, and so forth. 


6.2 Effects for Inferred Compliance Prompts 
We found a main effect of fixation on learning content for the 
“add subgoal” prompts (F13 = 13, p = .03, np* = .29), shown in 
Fig. 2d. This effect and related pairwise comparisons reveal that 
students learned more when they fixate more on the current page 
than when fixating less. Since students were instructed to add a 
new subgoal rather than process the current page, this finding 
suggests that complying with this prompt might not be effective 
for learning with MetaTutor, possibly because of the timing of 
this prompt, its frequency or its wording. Although only seven 
students with valid gaze data received this prompt, the effect size 
is large, suggesting it 1s worth conducting further analysis to as- 
certain whether and why complying with this prompt is not bene- 
ficial for learning. 


We found no effects and small effect sizes (see Appendix B) for 
the other inferred compliance prompts, namely stay on page and 
move to next page, two prompts related to metacognitive monitor- 
ing processes. We cannot make final conclusions on the pedagog- 
ical effectiveness on these prompts based on these results, because 
the dataset is not large and for this reason we did not include in 
the analysis other features that could indicate compliance (for 
example other eye-tracking measures such as fixation duration on 
text or gaze transitions from the text to other components of 
MetaTutor). However, it should be noted that we also found no 
effect for the explicit compliance prompts that foster metacogni- 
tive monitoring processes (stay on subgoal, move to next subgoal, 
and open the diagram, see previous section). This lack of effect 
for all prompts fostering metacognitive monitoring, even when 
compliance is explicitly assessed, suggests that these prompts are 
not beneficial for learning with MetaTutor. This could be due to 
the way these prompts are currently implemented in MetaTutor 
(e.g., their wording, timing delivery or frequency), or to the nature 
or the prompts itself. Our results nonetheless justify to run further 
analysis to ascertain whether (and why) prompts fostering meta- 
cognitive monitoring are not effective, and revise them as needed. 


7. CONCLUSION 


In this research we investigated the relationship between compli- 
ance with prompts designed to support the use of self-regulated 
learning (SRL) processes and learning gains while learning about 
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the human circulatory system with MetaTutor. We identified two 
approaches to evaluate compliance to MetaTutor’s prompts: 


(i) Assess compliance from students’ subsequent response to the 
prompts when students are forced to express compliance (e.g., by 
answering “yes” or “no” to a prompt); 


(ii) Run linear models to examine the influence on learning of a 
variety of student behaviors related to prompt compliance, when 
compliance is not elicited by MetaTutor. The behaviors we mined 
are based on both interface and eye-tracking data (e.g., time spent 
on that page, gaze fixations on the content of the page). 


Our results revealed that student learning gains are influenced by 
compliance with some, but not all SRL prompts provided by 
MetaTutor. Specifically, we found a positive influence on learning 
for prompts that foster learning strategies (revise a summary and 
review notes) as well as prompts that recommend setting a specif- 
ic learning subgoal. Based on these findings, it is worth exploring 
ways to improve compliance with these prompts. In particular, in 
future research we plan to examine whether forcing students to 
comply with these prompts or providing detailed explanations on 
how the prompted SRL strategies can be useful can improve 
learning. 


We found that compliance with the other MetaTutor’s prompts 
studied in this analysis does not improve learning. This finding 
reveals that assessing compliance to SRL prompts individually is 
useful to identify prompts that may not be effective at supporting 
learning. In particular, we found no results for all prompts related 
to metacognitive monitoring processes (e.g., staying on/moving 
away from the current page), suggesting to examine further why 
complying with these prompts do not influence learning with 
MetaTutor. For example, it could be due to their timing and fre- 
quency, their wording, their nature, and so forth. 


In this paper we also addressed the challenge of evaluating com- 
pliance with rather open-ended prompts for which there is no 
clear definition of compliance. Specifically we ran a linear regres- 
sion analysis to relate relevant compliance behaviors to learning. 
Such behaviors were derived from a combination of student inter- 
action and eye-tracking data after receipt of a prompt (e.g., time 
spent and amount of gaze fixations on a page can reveal compli- 
ance with prompt to read that page). Preliminary results show that 
such interaction-based and eye-tracking-based measures can help 
evaluate compliance. In future research, we plan to investigate 
further behavioral measures relevant to assessing compliance, 
such as tracking eye gaze patterns on the different components of 
MetaTutor as well as transitions between those components. 


Lastly, we plan to investigate the possibility of detecting in real 
time compliance with SRL prompts for which we found a positive 
effect on learning, using eye-tracking and interaction data. Such 
real-time detection could inform the design of adaptive prompts to 
foster compliance for those students who might otherwise disre- 
gard these prompts. For instance, adaptive prompts could force 
students to follow them or explain how the prompted SRL pro- 
cesses can improve learning. Evaluating such adaptive prompts 
fostering SRL processes would provide further insights on how 
students comply with and benefit from SRL prompts. 
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All statistical results for explicit compliance prompts (discussed in 
Section 6.1). Bold indicates a significant effect. 
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All statistical results for explicit compliance prompts (discussed in 
Section 6.2). Bold indicates a significant effect. 


os 


Window length np” = .04 


goal page 


np? = 02 


Proceedings of the 10th International Conference on Educational Data Mining 127 


Assessing Computer Literacy of Adults with Low Literacy 
Skills 


Andrew M. Olney 
Institute for Intelligent Systems 
University of Memphis 
Memphis, TN 38152 


aolney@memphis.edu 


Daphne Greenberg 


Department of Educational Psychology, Special 


Education, and Communication Disorders 
Georgia State University 
Atlanta, GA 30302 
dgreenberg@gsu.edu 


ABSTRACT 


Adaptive learning technologies hold great promise for im- 
proving the reading skills of adults with low literacy, but 
adults with low literacy skills typically have low computer 
literacy skills. In order to determine whether adults with 
low literacy skills would be able to use an intelligent tutor- 
ing system for reading comprehension, we adapted a 44 task 
computer literacy assessment and delivered it to 114 adults 
with reading skills between 3rd and 8th grade levels. ‘This 
paper presents four analyses on these data. First, we report 
the pass/fail data natively exported by the assessment for 
particular computer-based tasks. Second, we undertook a 
GOMS analysis of each computer-based task, to predict the 
task completion time for a skilled user, and found that it 
negatively correlated with proportion correct for each item, 
r(42) = —.4, p = .01. Third, we used the GOMS task de- 
composition to develop a Q-matrix of component computer 
skills for each task, and using logistic mixed effects models 
on this matrix identified five component skills highly pre- 
dictive of the success or failure of an individual on a com- 
puter task: function keys, typing, using icons, right clicking, 
and mouse dragging. And finally, we assessed the predictive 
value of all component skills using logistic lasso. 


Keywords 
adult literacy, computer literacy, GOMS, Q-matrix, mixed 
model, lasso 


1. INTRODUCTION 


Of adults with the lowest literacy levels, 43% live in poverty, 
and low literacy costs the U.S. economy $225 billion annu- 
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ally [14]. The need for literacy interventions is matched 
by the complexity of delivering interventions to this pop- 
ulation. Low literacy adults have difficulty attending face 
to face programs at literacy centers because of work, child 
care, and transportation [5], and even when these challenges 
are met, two-thirds of literacy centers have long waiting 
lists [14]. Adaptive computer-based interventions for liter- 
acy hold promise to overcome these challenges. Such in- 
terventions can be deployed in homes and local libraries, in 
addition to literacy centers. However, computer-based inter- 
ventions raise another question: can adults with low literacy 
skills use computers well enough to benefit? Several surveys 
suggest that this might be a problem. The demographics 
most affected by low literacy are the same demographics 
least likely to use the Internet (over age 50, making less 
than $30 thousand a year, and with less than a high school 
education [1]). 


Several decades of research have investigated computer lit- 
eracy using self-report measures as well as objective tests, 
i.e. multiple choice, and find that self-report measures tend 
to exaggerate proficiency while objective tests are more re- 
liable (see [3] for a review). For an adult literacy popula- 
tion, however, multiple-choice tests delivered as print create 
additional concerns as to whether the questions themselves 
can be comprehended. Recently a new type of assessment, 
known as the Northstar Digital Literacy Assessment (the 
Northstar), has been created that directly measures ability 
to perform computer tasks [13]. Unlike multiple choice as- 
sessments, the Northstar can simulate a computer desktop, 
use voice prompts to instruct users to perform tasks on that 
desktop, and then record their mouse clicks and keystrokes 
to determine if the task has been completed. Almost all 
of the tasks can be completed without reading by listening 
to the voice prompt instructions. The few tasks that do 
involve reading are word recognition tasks rather than sen- 
tence reading, e.g. a task to log in may require the user 
to copy a name and password to the appropriate boxes and 
so require reading of “Username,” “Password,” and the cor- 
responding fillers. The Northstar has been adopted as the 
computer literacy standard for adult basic education in the 
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state of Minnesota, which further supports its appropriate- 
ness for assessing the computer literacy skills of adults with 
low literacy skills. 


The present study investigated the computer literacy skills 
of adults with low literacy skills for the purpose of devel- 
oping an intelligent tutoring system for reading comprehen- 
sion for this population [7]. It includes a set of Northstar 
items that were collected to cover a range of potential inter- 
face and interaction components. In the remainder of the 
paper we describe the data collection procedure and four 
analyses performed, including pass/fail frequencies for each 
task, relation of these frequencies to GOMS-predicted exe- 
cution times for skilled users, a logistic mixed-model using a 
Q-matrix decomposition of the tasks into component skills, 
and a logistic lasso model to assess the predictive value of 
component skills. From these analyses we identify specific 
tasks that are problematic for adults with low literacy skills 
as well as component skills that make it more likely adults 
with low literacy skills will succeed or fail at a computer- 
based task. 


2. ANALYSIS 1: PROPORTION CORRECT 


2.1 Participants 

Participants (N = 114) were recruited through adult literacy 
centers in Atlanta, GA and Toronto, ON, from classes where 
the reading level was between 3rd and 8th grade. Reading 
level was determined by the centers using their “business as 
usual” assessments. Demographic surveys were completed 
by 90 participants (79% completion rate). Completed sur- 
veys indicated that participants were slightly more female 
than male (55 vs. 35) and that participant age ranged from 
17 to 69 (M = 42.74, SD = 13.73). 


2.2 Materials 


Forty-four items were selected from four (out of seven) of 
the Northstar modules available at the time of the study: 
Basic Computer Skills (21), WWW (13), Windows (6), and 
Email (4). Task descriptions are given in Table 1. Basic 
Computer Skills covered such topics as turning a computer 
on, identifying components of a computer, files and fold- 
ers, menus, and windows. WWW focused on browser-based 
activities like searching, search results, browser functionali- 
ties, and logging in. Although the Windows module focused 
on Windows overall, the items selected were fairly generic 
to any windowed operating system and mostly pertained to 
desktop applications. Email questions used a webmail inter- 
face (browser-based email client) and queried how one would 
create a new email, send an email, or similar email task. 
Because Northstar modules are integrated assessments, the 
Northstar Project compiled the items we selected into a cus- 
tom assessment for us. 


2.3 Procedure 
Participants first completed informed consent and then the 


demographic survey. Both informed consent and demographic 


survey were read aloud to participants to ensure comprehen- 
sion. Participants were then asked to sit in front of a com- 
puter to take the Northstar assessment. ‘The assessment was 
delivered in the browser using Adobe Flash. At the start of 
the assessment, a 3-minute orientation video was played ex- 
plaining how to answer questions in the assessment. If the 
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participant was confused about what to do, an experimenter 
was available to answer questions. Each question consisted 
of an voice prompt defining the task, which was also writ- 
ten at the top of the screen. A replay button was available 
to repeat the prompt. Participants could select, click, type, 
drag, etc. on the interface in an attempt to perform the task. 
If the participant did not know how to complete the task, 
they could press an “I Don’t Know” button, at which point 
the system scored their attempt as a failure. Attempts were 
only scored as a success if the participant completed the task 
in the manner requested in the prompt. The completion of 
each task initiated the next task until the assessment was 
complete. 


2.4 Results & Discussion 


The Northstar records success/failure of each participant on 
each task, and these data are reported in detail elsewhere [2]. 
Here we briefly note that the proportion of correct responses 
for each task is quite wide, ranging from .19 to .98. Tasks 
in which participants performed particularly well (propor- 
tion correct above .80) include identification tasks (e.g. for 
mouse, keyboard, headphone jack, and websites), turning on 
a computer or monitor, and common operations like recy- 
cling a file, using checkboxes, dragging, scrolling, and using 
hyperlinks. Tasks in which participants performed poorly 
(proportion correct below .60) include identification of var- 
ious keys, double- or right-clicking, typing web addresses, 
signing into email, and composing email. 


The proportion correct results from the Northstar indicate 
the adults with low literacy skills can power on their device 
and perform a variety of basic operations. ‘To the extent 
that these tasks exactly matched tasks that would be per- 
formed in a computer-based literacy intervention, like an 
intelligent tutoring system, this level of results is quite use- 
ful. However, for some tasks there is not an exact match, 
and the implications of the proportion correct results are 
less clear. For example, difficulties performing tasks using 
Word, Excel, or webmail may reflect problems with those 
specific interfaces that may not transfer to other programs. 
Understanding these more nuanced relationships would re- 
quire a deeper analysis than is afforded by Northstar’s suc- 
cess/failure output. 


3. ANALYSIS 2: GOMS MODELING 


The purpose of this analysis was to explore whether the 
success rate of the Northstar tasks could be modeled us- 
ing GOMS (Goals, Operators, Methods, & Selection rules), 
a well-known computational technique for modeling expert 
user performance on a task [10]. GOMS decomposes a par- 
ticular computer task, e.g. saving a file, into goals and sub- 
goals, perceptual, cognitive, and motor actions in service 
these goals, methods or sequences of operators that achieve 
a goal, and selection rules that choose between alternative 
methods. An important assumption of GOMS is that the 
users are expert at the computer task in question. Therefore 
GOMS models of execution time represent the upper bound 
of performance after a user has learned the interface and 
practiced it many times. The expert assumption of GOMS 
is violated in the adult literacy population, making the out- 
come of this analysis non-obvious. If the GOMS model pre- 
dictions of execution time were related to our adult’s perfor- 
mance, that would provide evidence that GOMS modeling 
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Table 1: Northstar Tasks 
Recycle file 
Checkboxes 
Organize folder options 


Click stop loading 
Select search engines 
Google query 


Click on the monitor 
Click on the keyboard 
Click on the system unit 


Click on the headphone jack | Start menu, lauch program | Google scroll 
Click on picture of a mouse 
Newline key 

Caps key 

Shift key 


Use hyperlink 

Maximize window 
Minimize window 

Open Excel 

Open Word using taskbar 
Close Word 

Select login and password 
Choose secure password 
Sign into email 

Compose email 


Turn up audio slider 
Mute audio 

Select browser icons 
Click on the website 
Drag item in browser 
Click on address bar 
Type the web address 
Click homepage button 
Click browser back button 
Click browser refresh 
Click browser forward 


Backspace key 

Up arrow 

Turn on monitor 

Turn on computer 

Log on to computer 
Double click on Documents 
Right click menu 
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Figure 1: A CogTool annotation of a Northstar 
task. Annotations appear as semi-transparent or- 
ange boxes over the Northstar interface. 


has some validity for this population. 


3.1 Procedure 

The CogTool system was used to perform a GOMS analysis 
[11, 9]. CogTool allows the easy creation of GOMS models 
by annotating an existing user interface, and then recording 
a demonstration of the task against than annotated inter- 
face. Figure 1 shows the CogTool interface for the “Click on 
the mouse” task. For example, when the Northstar task re- 
quired clicking on an icon, button, or other interface element 
as in Figure 1, a CogTool button annotation was overlaid on 
the interface, and then in demonstration mode the modeler 
would demonstrate the task by clicking on the annotated 
button. From this demonstration on the annotation, Cog- 
Tool builds a GOMS model that includes the perceptual, 
cognitive, and motor tasks required to perform the task. 
Similar annotations were made for auditory directions, key- 
board input, and other kinds of interface actions. Once a 
task was annotated and demonstrated, a CogTool simulation 
was run on GOMS model to generate a predicted execution 
time of expert performance. Annotations, demonstrations, 
and execution time predictions were performed for all 44 
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Northstar items used in Analysis 1. 


3.2 Results & Discussion 

GOMS-predicted execution times for Northstar tasks ranged 
from 3.0 to 17.1 seconds (M = 6.88, SD = 4.07). These ex- 
ecution times were significantly negatively correlated with 
proportion correct, r(42) = —.40, p = .01, CIo95|—.61, —.10], 
indicating that tasks predicted to take an expert longer 
to accomplish were more likely to be answered incorrectly 
by low literacy adults. Tasks that take longer are inher- 
ently more complex and require more operations to com- 
plete. These results suggest that GOMS has some valid- 
ity for modeling the performance of adults with low literacy 
skills even though it was not intended for this purpose. How- 
ever, by themselves these results convey little additional in- 
sight. The GOMS-predicted execution times, generated by 
CogTool, are still at the task level rather than the com- 
ponent skills required to achieve each task. ‘This is partly 
because the orientation of CogTool is to produce execution 
times and partly because of the expert orientation of GOMS. 
For example, in GOMS the factors involved in clicking a but- 
ton are the perceptual (size, location) and motor operations 
involved, but in Northstar, some “buttons” are tapping spe- 
cific types of knowledge, like identifying hardware, under- 
standing icons, or various keys on a keyboard. The different 
types of knowledge behind the various CogTool annotations 
are not represented or considered in the GOMS analysis it 
provides. 


4. ANALYSIS 3: Q-MATRIX & LOGISTIC 
MIXED MODELS 


We would like to understand how the component skills un- 
derlying Northstar tasks differentially affect the probability 
a low literacy adult will perform the task correctly. In ed- 
ucational data mining, component skills are typically mod- 
eled using a Q-matrix analysis [4]. In its simplest form, 
a Q-matrix analysis constructs a problem by skill matrix 
such that a cell;; in the matrix represents whether skill; is 
needed to solve problem;: cell;; = 1 if skill; is needed to 
solve problem,;, and cell;; = 0 if skill; is not needed to solve 
problem;. Analysis 2 provides a useful guide towards the 
creation of a Q-matrix for the Northstar tasks, as it has al- 
ready captured each component action required to perform 
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Table 2: Component skills coded from GOMS 


Component Skill Probability Correct Given Skill 


Checkboxes 89 
Mouse Drag .86 
Hardware Identify 83 
Hardware Function 78 
Complex Scrolling .T4 
Browser Functions .66 
Left Click .64 
Use Icons 61 
Double Click 08 
Window Functionality .06 
Program Brands OO 
Desktop Concept 03 
Select Menu 90 
Good Login Info 50 
Login Info 48 
Keyboard Function .46 
Simple Typing 43 
Right Click 19 


each task. What it lacks in some cases, however, is an an- 
notation of the knowledge behind each component action. 


4.1 Procedure 

The first author recoded the GOMS task annotations with 
18 novice-relevant component skills. ‘The coding was done 
in one pass, and component skills were defined on the fly. 
Component skills that occurred in only one task were then 
removed as they offer no predictive utility for other tasks. 
The appropriateness of the component skills was evaluated 
by correlating the total number of component skills needed 
in each task with the GOMS execution time and the pro- 
portion correct for the respective task. We used a logistic 
mixed model to predict the correctness of each participant 
on each task as a function of the presence of component 
skills for that task. This analysis addresses the question as 
to whether there is an effect (main effect) of the presence of 
component skills on the likelihood that an adult with low lit- 
eracy skills will be able to perform the task correctly. Using 
a logistic mixed model in this way has strong similarities to 
cognitive psychometric models like Diagnostic Classification 
Models [16] or more specifically a mixed model implementa- 
tion of linear logistic test models [15]. 


In the logistic mixed model, random slopes were initially in- 
cluded but failed to converge. Random intercepts for task 
and participant are theoretically motivated, and backward 
selection of these effects using Akaike information criterion 
(AIC) achieved a minimum when these effects were included, 
indicating that these intercepts should remain in the model. 
These random intercepts can be considered as per-task dif- 
ficulty not captured by component skills and per-subject 
ability, respectively. The initial model that included Left 
Click was rank deficient, so Left Click, which appears in 
most tasks, was removed from the final model. Addition- 
ally, the total number of component skills in each task (i.e. 
column sums of the Q-matrix) was initially considered as 
a predictor of correctness, but was excluded based on ex- 
tremely high collinearity, having a variance inflation factor 
of over 40. 
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4.2 Results & Discussion 


The component skills and the conditional probability that 
a task will be correctly performed if the component skill is 
present are shown in Figure 2. ‘Total component skills per 
task was marginally positive correlated with GOMS execu- 
tion time, r(42) = .27, p = .07, Clg95{—.02, .53], suggesting 
that tasks with more component skills take longer to per- 
form. ‘Total component skills per task was significantly nega- 
tively correlated with proportion correct, r(42) = —.35, p= 
02, CI95[—.59, —.06], indicating that tasks with more com- 
ponent skills are more difficult to perform correctly. The 
correlation between predicted execution time and propor- 
tion correct was not significantly different from the correla- 
tion between total component skills and proportion correct, 
t(82) = .18, p = .86, indicating that the Q-matrix decompo- 
sition of component skills is comparable to the GOMS exe- 
cution time in terms of its relationship to proportion correct- 
ness. Altogether these correlation results provide additional 
evidence that the Q-matrix decomposition is appropriate. 


The logistic mixed model had a marginal R* of .18 (fixed 
effects only) and a conditional R? of .47 (including ran- 
dom effects) [12]. We found a positive main effect of Mouse 


A 


Drag, 6 = 2.06, SE = .90, p = .02, such that tasks with 
a Mouse Drag component were 7.87 times as likely to be 
answered correctly, C'I95[1.36, 45.50], and a marginal main 
effect of Hardware Identify, 8B = 69. Dh = 293) 7 =— 10, 
such that tasks with a Hardware Identify component were 
2.44 times as likely to be answered correctly, C'I95|.86, 6.94]. 
We found negative main effects for Keyboard Function, 8B = 
—1.31, SE = .51, p = .01, Use Icon 8 = —1.35, SE = 
55, p = .01, Simple Typing 6 = —1.91, SE = .64, p = .003, 
and Right Click 6 = —3.20, SE = 1.34, p < .02, such that 
tasks with a Keyboard Function component were .27 times 
as likely to be answered correctly, C'I95[.10, .73], tasks with 
a Use Icon component were .26 times as likely to be an- 
swered correctly, C'I95[.09, .75], tasks with a Simple Typing 
component were .15 times as likely to be answered correctly, 
C'I95[.04, .52], and tasks with a Right Click component were 
.04 times as likely to be answered correctly, CJ95[.00, .56]. 


We found that Mouse Drag was extremely predictive of suc- 
cess. The reason is unclear, but we hypothesize that the 
frequency of mouse dragging in many computer tasks may 
have afforded participants the opportunity to become expert 
in this skill. Mouse dragging has some similarity to swiping 
on a smartphone or tablet interface, so it may be that ex- 
pertise with other devices has transferred into the Northstar 
tasks. Amongst the components that predict failure, per- 
haps the most intuitive are Keyboard Function and Simple 
Typing. Typing is a complex skill that takes practice to mas- 
ter. Function keys are difficult in that they don’t themselves 
produce a character, but either operate on a character on the 
screen (Delete) or work in combination with another key to 
modify it (Shift). The negative effects associated with Use 
Icon and Right Click are somewhat surprising. Icons come 
in many different variations, and so it is possible that the 
negative Use Icon effect is attributable to a lack of knowl- 
edge of specific icons or perhaps to the conventions of icons 
generally. Right Click is possibly rare and usually brings up 
a context menu with commands that are often available else- 
where, making it more relevant for power users but perhaps 
less so to novice users. 
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Figure 2: The coefficient path for the lasso model. As the L1 sparsity threshold increases along the x-axis, 


more coefficients are non-zero. 


5. ANALYSIS 4: Q-MATRIX LASSO 


Analysis 3 provides a more traditional analysis of signifi- 
cant predictors in our study, but must be interpreted with 
caution with respect to generalizing to new data. It may 
be that insignificant predictors in Analysis 3 nevertheless 
have predictive value on new data. The problems of rely- 
ing on p-values or criteria like AIC to select variables are 
well known [8]. To explore the predictive potential of the 
Q-matrix component skills, we created a lasso model (least 
absolute shrinkage and selection operator [18]), a form of 
regression that promotes sparsity (i.e. zero coefficients) and 
predictive accuracy simultaneously. While not necessarily 
the best predictive model (cf. gradient boosting [6]), lasso 
has the advantage of being simple to interpret, and thus our 
results can guide what variables to use in future models. 


5.1 Procedure 

A logistic regression base model without random effects was 
initialized with 17 component skills (Left Click excluded) 
and submitted to lasso. Because lasso has a free parameter, 
A, that controls sparsity of the regression, a lasso analysis 
varies the level of 4 and generates regression coefficient esti- 
mates at each level. This sequence of regression coefficients 
is known as the regularization path. The value of A that 
minimized prediction error was estimated using both cross 
validation and AIC. 


5.2 Results & Discussion 


The coefficient (regularization) path for the lasso model is 
shown in Figure 2 and the corresponding AIC curve is shown 
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Figure 3: The AIC curve for the lasso model. Lower 
values of AIC indicate better model fit. 


in Figure 3. In Figure 2, the center line represents coef- 
ficients having zero values. As the L1 sparsity threshold 
(|beta|) increases, more coefficients become non-zero. For se- 
lecting the optimal A that minimizes overall prediction error, 
ten-fold cross validation and AIC yielded congruent results. 
AIC results are depicted in the curve in Figure 3, which 
shows that that AIC improves as |beta| increases, coming to 
a minimum at |beta| = 13.40. Accordingly, most coefficients 
for the optimal lasso model are non-zero. 
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Table 3: Lasso component skill coefficients 


Component Skill B  exp(B) 
Mouse Drag 1.80 6.02 
Checkboxes L227 3.900 
Login Information 88 2.41 
Hardware Identify .60 1.82 
Hardware Function 00 1.65 
Desktop Concept 30 1.43 
Browser Functions 24 Le 
Double Click 11 1.12 
Complex Scrolling 03 1.03 
Program Brands OO 1.00 
Select Menu -.21 81 
Window Functionality  -.44 .64 
Keyboard Function -.86 2 
Use Icons -1.01 06 
Good Login Info -1.28 .28 
Simple Typing -1.47 23 
Right Click -2.39 .O9 


Table 3 gives the 6 coefficients (log odds) for the AIC- 


Aa 


optimal model as well as the odds ratio exp(() for each co- 
efficient. The coefficients converted to odds ratios have the 
same interpretation as in the logistic mixed model, e.g. tasks 
with a Mouse Drag component are 6.02 times as likely to be 
answered correctly as those without. Although the logistic 
lasso model does not include random intercepts correspond- 
ing to task difficulty and subject ability, the magnitudes of 
coefficients in the logistic lasso are highly comparable to the 
logistic mixed model. However, the strength of the coeffi- 
cients in the logistic lasso are weaker, in general, than in 
the logistic mixed model, suggesting that the logistic mixed 
model may be slightly over-fitted. For example, according to 
the logistic mixed model, Mouse Drag tasks are 7.87 times 
as likely to be answered correctly, but according to the lo- 
gistic lasso model, Mouse Drag tasks are only 6.02 times as 
likely to be answered correctly; similarly Right Click con- 
taining tasks in the mixed model are .04 times as likely to 
be answered correctly compared to .09 times as likely in the 
logistic lasso. These results suggest that while the logistic 
mixed model might be more appropriate for assessment pur- 
poses, as it additionally estimates task difficulty and subject 
ability, the logistic lasso model might be more appropriate 
for predicting the effects of component skills on success rates 
for new tasks. 


6. GENERAL DISCUSSION 


Together, our results suggest that not only are there spe- 
cific Northstar tasks that are informative with regard to 
building an adaptive computer-based intervention for adults 
with low literacy skills but also that these tasks can them- 
selves be decomposed into component skills that can be 
further used for this purpose. The main effects of Analy- 
sis 3 and coefficient rankings of Analysis 4 are consistent 
and complimentary with the proportion correct results in 
Analysis 1. The marginal main effect for Hardware Iden- 
tify explains the high proportion correctness for identifica- 
tion tasks for mouse, keyboard, and headphone jack, and the 
main effect for Mouse Drag explains the high proportion cor- 
rectness for recycling a file (dragging to the Recycle Bin), 
dragging, and scrolling (by dragging a scroll bar). These 
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correctness-enhancing main effects are also reflected in odds 
ratios greater than one in Analysis 4. Similarly the main ef- 
fects for Keyboard Function and Simple Typing explain the 
low proportion correctness for identifying various keys, typ- 
ing web addresses, signing into email, and composing email, 
and these main effects are likewise reflected in odds ratios 
less than one in Analysis 4. In these cases we infer that 
the problem is not specific to the interface in question, e.g. 
email, but rather that there is a deficiency in a component 
skill needed for the task taking place in the context of that 
interface. 


The implications for building adaptive computer-based in- 
terventions for adults with low literacy skills are clear. First, 
it is important to keep typing to a minimum, either by hav- 
ing users select response options or by using speech recog- 
nition. Second, right clicking should be eliminated or at 
least made optional. Third, icons should be close to icon 
archetypes. And finally, mouse dragging is a good skill 
around which to build user interaction. Interestingly, all 
of these implications seem to point to tablet and smart- 
phone platforms, which have a minimum of typing (and 
built in speech interfaces), no right clicking, minimal icons 
in-app, and plenty of swiping/dragging. Moreover, smart- 
phone ownership has been rapidly increasing — now 64% of 
households earning below $30 thousand own a smartphone 
[17]. It may be the case that deploying interventions on 
smartphones and tablets better makes use of both the com- 
puter literacy strengths and the material resources of low 
literacy adults. 
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ABSTRACT 


Research in Educational Data Mining could benefit from greater 
efforts to ensure that models yield reliable, valid, and interpretable 
parameter estimates. These efforts have especially been lacking 
for individualized student-parameter models. We collected two 
datasets from a sizable student population with excellent “depth” 
— that is, many observations for each skill for each student. We fit 
two models, the Individualized-slope Additive Factors Model 
(GAFM) and Individualized Bayesian Knowledge Tracing (iBKT), 
both of which individualize for student ability and student 
learning rate. Estimates of student ability were reliable and valid: 
they were consistent across both models and across both datasets, 
and they significantly predicted out-of-tutor pretest data. In one of 
the datasets, estimates of student learning rate were reliable and 
valid: consistent across models and significantly predictive of 
pretest-posttest gains. This is the first demonstration that 
statistical models of data resulting from students’ use of learning 
technology can produce reliable and valid estimates of individual 
student learning rates. Further, we sought to interpret and 
understand what differentiates a student with a high estimated 
learning rate from a student with a low one. We found that 
learning rate is significantly related to estimates of student ability 
(prior knowledge) and self-reported measures of diligence. 
Finally, we suggest a variety of possible applications of models 
with reliable estimates of individualized student parameters, 
including a more novel, straightforward way of identifying wheel 
spinning. 


Keywords 


Explanatory models, model interpretability, individualized 
parameters, 3, Additive Factors Model, individualized Bayesian 
Knowledge Tracing 


1. INTRODUCTION 


In Educational Data Mining, statistical models are typically 
evaluated based on fit to overall data and/or predictive accuracy 
on test data. While this is an important initial step in evaluating 
the contributions of advancements in statistical and cognitive 
modeling, research in the field could benefit from greater efforts 
to ensure that models are reliable and valid. More reliable and 
valid models offer more explanatory power, contributing to the 
advancement of learning science. They also inspire greater 
confidence that deploying model advancements in future tutoring 
systems will genuine result in the hypothesized improvements to 
learning. 


Some recent work has been done towards interpreting, validating, 
and acting upon cognitive/skill modeling improvements [7, 8, 10, 
11, 17]. Educational data mining efforts oriented around 
personalizing student constructs [3, 12, 13, 14, 18], however, have 
remained focused on improving predictive accuracy and/or 
demonstrating hypothetical time savings. Little has been done to 
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validate or understand the estimates that models with 
individualized or clustered student parameters produce. 
Anecdotally, efforts to do so have shown that these individualized 
student parameter estimates, or discovered student clusters, are 
often difficult to interpret. 


It is especially critical to examine the reliability and validity of 
parameter estimates for modeling advancements that dramatically 
increase the parameter count, as is generally true for 
individualized student-parameter models. More parameters create 
greater degrees of freedom and increase the likelihood that the 
model may be underdetermined by the data. 


We focus on the question: To what degree can we trust a model’s 
parameter estimates to correctly represent the constructs they are 
supposed to? 


Key to expecting reliable, valid estimates of student-level 
constructs is not just big data in the “long” sense, but big data in 
the “deep” sense. Oftentimes, the datasets used in secondary 
analyses in EDM are large in terms of total number of students (or 
total observations) but highly sparse in terms of observations per 
skill, per student. These features make it difficult to get reliable 
measurements of constructs at the individual student level, 
particularly constructs related to learning over time. 


Here, we collected two datasets from a sizable student population 
(196 students) with excellent “depth” — that 1s, many observations 
for each skill for each student. We then fit two models that 
individualize for student ability and student learning rate (the 
Individualized-slope Additive Factors Model [9] and 
Individualized Bayesian Knowledge Tracing [18]). We assess the 
models’ fit to data and predictive accuracy. We also move beyond 
these metrics to examine the reliability of the models’ estimates of 
student ability and student learning rate. Additionally, we 
externally validate the parameter estimates against out-of-tutor 
assessment data. 


We further interpret and understand the constructs by visualizing 
representative student learning trajectories, examining the 
relationship between estimated student ability and _ student 
learning rate, and the relationship between those constructs and 
self-reported data on motivational attributes. Finally, we propose 
some useful applications of reliable and valid individualized 
student-parameter models, including a new way to detect wheel 
spinning. 


2. PRIOR WORK 


Prior work on individualizing student parameters has focused on 
variants of Bayesian Knowledge Tracing (BKT) [3]. This work 
includes modeling the parameters separately for each individual 
student instead of separately for each skill [3], individualizing the 
PUnit) (“initial knowledge”) parameter for each student [13], and 
individualizing both P(Init) and P(Learn) (“learning rate’’) to the 
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base BKT model [18]. These models have generally focused on 
assessing predictive accuracy improvements relative to their 
respective non-individualized baseline models. 


There have also been some “time savings” analyses [12, 18] that 
evaluate the hypothetical real world impact that individualizing 
statistical model fits could have. These analyses report the effect 
of fitting individualized BKT models, compared to traditional 
BKT, on the hypothetical number of under- and over- practice 
attempts that would be predicted for each student. Results 
generally have indicated that many more practice opportunities 
are needed for models to infer the same level of knowledge when 
using whole-population parameters rather than individual student 
parameters. These analyses show that individualized models differ 
in their hypothetical decision points if they were to be applied to 
drive mastery-based learning, but they do not in and of themselves 
interpret the individualized parameter estimates, nor do they 
assess the reliability and validity of such estimates. 


In a previous effort to better understand individualized student 
learning rate parameters [9], we examined predictive accuracy and 
parameter reliability in an extension of the Additive Factors 
Model [2] applied to existing educational datasets. We did not 
find evidence that individualizing student rate parameters 
consistently improved predictive accuracy improvements, nor 
could we validate the parameter estimates on out-of-tutor 
assessment data. However, the datasets we analyzed either 
contained a small number of students or were largely sparse in 
observations for student-skill pairs, with the exception of two 
datasets. These two datasets happened to be the ones on which the 
Individualized-slope Additive Factors Model did achieve higher 
predictive accuracy. Thus, we wondered if the sparsity of the 
datasets were the primary limitation, rather than the modeling 
advancement itself. This idea is corroborated by the fact that 
pooling students into “groups” rather than generating 
individualized estimates worked well on those datasets [9]. 


For the present modeling work, we collected our own data in 
order to ensure the data features that we believe are necessary for 
reliable, valid, and potentially meaningful estimates of constructs 
at the individual student level. 


3. METHODS 


It is common in EDM to do secondary analyses across multiple 
datasets. However, it can be difficult to find datasets that (1) 
contain a sizable number of students, (2) contain many 
observations for each skill for each student (1.e., are not sparse), 
(3) contain students spanning a range of abilities in the domain 
covered by the tutor, and (4) contain data from out-of-tutor 
assessment data that is well-mapped to the content in the tutor. 


For the present work, we wanted to use as close to an “ideal” 
dataset as possible for estimating student parameters. We 
collected our own dataset with a sizable number of students (196), 
many observations (5-50, depending on the skill) for each skill for 
each student. In addition, we ensured that a wide range of student 
ability levels was represented in our data to allow for the 
possibility that models could capture this variability. 


3.1 Data Collection 


196 students, spanning 10 classes taught by three different 
teachers, enrolled in high school geometry participated in two 
studies conducted about a month apart. A range of student 
abilities were included in the study. Two of the 10 classes were 
“Honors” and three of the 10 classes were “Inclusion”. Honors 
classrooms are intended for students who have strong theoretical 
interests and abilities in mathematics. Inclusion classrooms are 


“general education” classrooms designed to provide the 
opportunity for individuals with disabilities and special needs to 
learn alongside their non-disabled peers. 


Students spent five consecutive days participating in each study 
during their regular geometry class periods. On the first and last 
days, they took a computerized pretest and posttest, respectively. 
During the middle three days, they worked within an intelligent 
tutoring system [19] designed to give them practice on their 
current chapter’s content. This procedure applied to both studies, 
one of which covered the students’ Chapter 3 content (Parallel 
Lines Cut by a Transversal, Angles & Parallel Lines, Finding 
Slopes of Lines, Slope-Intercept Form, Point-Slope Form) and the 
other of which covered the students’ Chapter 4 content 
(Classifying Triangles, Finding Measures of Triangle Sides & 
Angles, Triangle Congruence Properties). Figure 1 shows an 
example problem interface from the intelligent tutoring system, 
which was designed using Cognitive Tutor Authoring Tools [1]. 


3-2 Angles & Parallel Lines: 4 1234567891011 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 px 


[——-—} Know Angle Theorems 
SS Find Angle: Single-Step 
[— | Find Angle: Multi-Step 
f —-| Find Angle: Variable 
eqz Examples 


Find the value, in degrees, of 24. 32 


Which postulate or theorem best explains your reasoning? 


alternate interior angles theorem + 


Figure 1. Example problem interface from the intelligent 
tutoring system used for data collection. 


We also collected self-report survey data on motivational factors 
falling along three dimensions. These were Competitiveness (e.g., 
“In this unit, I am striving to do well compared to other students” 
and “In this unit, I am striving to avoid performing worse than 
others’’), Effort (e.g., “I am striving to understand the content of 
this unit as thoroughly as possible” and “I work hard to do well in 
this class even if I don't like what we are doing’’), and Diligence 
(e.g., “when class work is difficult, I give up or only study the 
easy parts” [inverted scale] and “I am diligent’). Self-report 
measures were indicated on a Likert scale from 1-7. 


A key reason we collected two datasets, covering two distinct 
chapters of the curriculum, is that we were interested in 
investigating the consistency of student-level parameter estimates 


across different content, time, and contexts. We discuss this 
further, along with preliminary results, in Section 4.4.1. 


3.2 Statistical Models 


3.2.1 The Individualized-slope Additive Factors 


Model (i‘AFM) 

The Additive Factors Model (AFM) [2] is a logistic regression 
model that extends item response theory by incorporating a 
growth or learning term. 


In (=) = 0) + Yexcs Qx(B + VkTik) (1) 


Proceedings of the 10th International Conference on Educational Data Mining 136 


This statistical model (Equation 1) gives the probability p;; that a 
student 7 will get a problem step 7 correct based on the student’s 
baseline ability (0;), the baseline easiness (f;,,) of the required 
knowledge components on that problem step (Qj;,), and the 
improvement (y;) in each required knowledge component (KC) 
with each additional practice opportunity. This KC slope, or 
“learning rate,” parameter is multiplied by the number of practice 
opportunities (7;,) the student already had on it. Knowledge 
components (KCs) are the underlying facts, skills, and concepts 
required to solve problems [6]. 


Individualized-slope AFM (iAFM) builds upon this baseline 
model by adding a per-student learning rate parameter (6;). This 
parameter represents the improvement (6;) by student 7 with every 
additional practice opportunity with the KCs required on problem 


step /. 


Dij 
In (25) = 9) + Meexcs Qjx (Be + VeTix + OiTix) (2) 
The KC and student learning rate parameters are both multiplied 
by the number of opportunities (7;;) the student already had to 
practice that KC. 


3.2.2 Individualized Bayesian Knowledge Tracing 
(iBKT) 

Bayesian Knowledge Tracing (BKT [3]) is an algorithm that 
models student knowledge as a latent variable using a Hidden 
Markov Model. The goal of BKT is to infer, for each skill, 
whether a student has mastered it or not based on his/her sequence 
of performance on items requiring that skill. It assumes a two- 
state learning model whereby each skill is either Anown or 
unknown. There are four parameters that are estimated in a BKT 
model: the initial probability of knowing a skill a priori — p(Init), 
the probability of a skill transitioning from not known to known 
state after an opportunity to practice it — p(Learn), the probability 
of slipping when applying a known skill — p(Slip), and the 
probability of correctly guessing without knowing the required 
skill — p(Guess). Fitting BKT produces estimates for each of these 
four parameters for every skill in a given dataset. BKT models are 
usually fit using the expectation maximization method (EM), 
Conjugate Gradient Search, or discretized brute-force search. 


Individualized Bayesian Knowledge Tracing ((BKT [18]) builds 
upon this baseline BKT model by individualizing the estimate of 
the probability of initially knowing a skill, p(Init), and the 
transition probability, p(Learn), for each student. To accomplish 
the student-level individualization of these parameters, each of 
them is split into skill- and student-based components that are 
summed and passed through a logistic transform to yield the final 
parameter estimate. Details on the decomposition of p(Init) and 
p(Learn) into skill- and student-based components are described 
in [18]. 


4. RESULTS 
4.1 Model Fit & Predictive Accuracy 


As a first pass evaluation of the two individualized models, we 
assessed them using Akaike Information Criterion (AIC) and 
Bayesian Information Criterion (BIC), which are standard metrics 
for model comparison, and 10 independent runs of split-halves 
cross validation (CV). Although 10-fold cross validation has been 
popular in the field, [4] showed that it has a high type-I error due 
to high overlap among training sets and recommended at least 5 
replications of 2-fold CV instead. 


Here, the comparison of interest is each individualized model 
against its non-individualized counterpart. We do not encourage a 


literal comparison between the predictive accuracies of the two 
classes of models due to differences in whether they use incoming 
test data towards their predictions on later test data (BKT/IBKT 
do, and AFM/iAFM do not). 


Both iAFM and iBKT outperform their non-individualized 
counterparts by all metrics, with the exception of BKT having a 
better BIC value than iBKT for the Chapter 4 dataset. This is not 
surprising, as BIC is known to over-penalize for added 
parameters. We recommend cross validation as a better indicator 
that 1BKT is the true better fitting model in this case. 


Counter to the majority of findings reported in [9], 1AFM 
achieved higher predictive accuracy than AFM in both datasets 
here. This further supports the idea that the “depth” of the dataset 
is a critical factor in whether an individualized student-parameter 
model can explain unique variance in the data. 


Table 1. Summary of Model Fit and Predictive Accuracy 
metrics comparing AFM vs. iAFM and BKT vs. iBKT. Cross- 
validation values are mean RMSE values across 10 runs, with 
standard deviations included in parentheses. 


CV Test RMSE 
(10-Run Average) 


0.38440 (0.0039) 
0.37868 (0.0044) 


BIC 


0.4222 (0.0005) 
0.3777 (0.0006) 
0.41037 (0.0048) 
0.40789 (0.0050) 


0.44091 (0.0014) 


0.40725 (0.0018) 


4.2 Reliability of Student Parameters 


Next, we examined the degree to which we can rely on these 
parameters to reasonably estimate the constructs that they should 
be estimating. We believe that a strong relationship between the 
parameter estimates of two statistical models with entirely 
different architectures is a high bar for testing reliability. That 1s, 
if a student genuinely displayed evidence of high overall ability in 
a dataset (relative to his/her peers), then both 1AFM and iBKT 
should estimate that to be the case. 


Because of known and observed nonlinear relationships between 
logistic regression and Bayesian Knowledge Tracing parameter 
estimates, we measured correlation based on Spearman’s 
coefficient (r,), which is based on rank order. 


We observed strong and statistically significant correlations 
between 1AFM Student Intercept and i1BKT Student p(Init) 
parameter estimates (Figure 2, top row). We also observed a 
strong and statistically significant correlation between 1AFM 
Student Slope and iBKT Student p(Learn) parameter estimates for 
one of the two datasets (Chapter 4). This correlation was much 
milder, though still significant, for the other dataset (Chapter 3). 


We hypothesize that this difference between datasets may be due 
to the presence of more difficult KCs in Chapter 4. A dataset with 
more difficult items should provide more sensitive measures of 
individual differences in improvement, since it avoids ceiling 
effects. Indeed, this was the case: the mean KC easiness parameter 
estimate (6;,) for chapter 4 was 0.799 (which translates to a 
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probability of 0.69), compared to 1.253 for chapter 3 (which 
translates to a probability of 0.78). When students are practicing 
many opportunities at ceiling (which was the case in particular for 
chapter 3, based on exploratory analyses of the data), the 
individualized models will often assign them a lower “learning 
rate” due to an essentially flat learning trajectory. 


Chapter 3 Chapter 4 


-2 


iAFM Student Intercept Estimate 


iAFM Student Intercept Estimate 
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Figure 2. Relationships between iAFM Student Intercept and 
iBKT Student p(Init) parameter estimates (top row), and 
between iAFM Student Slope and iBKT Student p(Learn) 
parameter estimates (bottom row), for the two datasets. 


4.3 Validity of Student Parameters 

To assess the validity of student parameter estimates, we related 
them to out-of-tutor assessments of the relevant student 
constructs. In this case, we validated parameter estimates using 
pretest and posttest assessment data collected in the study. 


4.3.1 Estimates of Student Ability 

The Student Intercept (0;) parameter of 1AFM and the Student 
p(Init) parameter of BKT are designed to estimate baseline 
student ability, as least for the knowledge domain represented in 
the dataset. To validate the models’ estimates of this construct, we 
examined relationships between the model estimates and students’ 
pretest scores, which are an out-of-tutor assessment of student 
initial ability for the skills covered by the tutor. 


We report standard Pearson correlation coefficients here, since the 
relationships between pretest scores and the parameter estimates 
did not appear to be particularly nonlinear. 


Figure 3 illustrates a summary of these relationships. Both 
models’ estimates of the student ability construct were strongly 
and significantly correlated with pretest scores. 


In addition, adding an individualized student slope improved the 
validity of the model’s estimate of student ability (a parameter 
that’s modeled in both AFM and 1AFM). We compared the 
correlations between AFM’s intercept estimates to pretest scores 
(Chapter 3: r = 0.62, p < 0.0001, Chapter 4: r = 0.58, p < 0.0001) 
to iAFM’s intercept estimate / pretest score correlations (Chapter 
3: 0.74, p < 0.0001, Chapter 4: r = 0.66, p < 0.0001). 


This has several interesting implications for educational 
applications. First, it suggests that formative assessment via 
modeling of process data as learning unfolds is a reasonable 
method of assessment. 


It also suggests that detailed assessment data (e.g., from a pretest) 
could be used to reasonable effect to improve different students’ 
“on-line” estimates of students’ knowledge of KCs. For example, 
combining KC parameter estimates (derived from model-fitting to 
prior domain-relevant data) with student intercept priors based on 
pretest assessment data would allow a model like AFM to 
generate individualized predictions of how much each student 
needs to practice to reach mastery. 


In addition, these results suggest that individualized BKT models 
could use pretest assessment data to “set” reasonably valid 
student-specific p(Init) values before collecting any within-tutor 
data from those students. 


In considering the degree to which these results may generalize, it 
is important to note that the pretests in the present datasets were 
specifically designed to map closely to the practice problems in 
the intelligent tutor. Pretests contained 1-2 questions for each KC 
that was practiced in the tutor, and the items were similar to those 
encountered within the tutor. 
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Figure 3. Relationships between out-of-tutor pretest scores 
and iAFM/iBKT estimates of student ability based on within- 
tutor data. 


4.3.2 Estimates of Student Learning Rate 

Given that the only external assessment data collected were a 
pretest and posttest, we sought to validate the construct of student 
learning rate (as estimated by the models) on pretest-posttest 
gains. Students were given roughly the same amount of time to 
engage with the tutors, so those with accelerated learning rates 
might be expected to gain more knowledge in the time available. 


Thus, we examined the degree to which student learning rate 
estimates predicted pretest-posttest gains while controlling for 
pretest scores. We controlled for pretest scores because they have 
been shown to negatively predict learning gains due to assessment 
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ceiling effects. That is, students who start out performing well on 
the pretest have less “room for improvement”. 


For the Chapter 3 dataset, iAFM Student Slope (6;) estimates did 
not significantly predict learning gains. In a linear regression 
predicting pretest-posttest gains, pretest scores were a significant 
predictor (B=-0.189, p=0.005) and Student Slope estimates were 
not (B=0.396, p=0.144). iBKT Student p(Learn) estimates did not 
significant predict learning gains. In a linear regression predicting 
pretest-posttest gains, pretest scores were a significant predictor 
(B=-0.226, p=0.005) and Student Slope estimates were not 
(B=0.062, p=0.218). 


For the Chapter 4 dataset, 1AFM Student Slope (6;) estimates 
significantly predict learning gains. In a linear regression 
predicting pretest-posttest gains, pretest scores (B=-0.641, 
p<0.0001) and Student Slope estimates (B=0.576, p=0.007) were 
both significant predictors. iBKT Student p(Learn) estimates also 
significantly predict learning gains. In a linear regression 
predicting pretest-posttest gains, pretest scores (B=-0.645, 
p<0.0001) and p(Learn) estimates (B=0.133, p=0.004) were both 
significant predictors. 


For one of the two units (Chapter 4), we observed that student 
learning rate estimates were validated on external assessments of 
learning gain. Interestingly, this is the same unit for which we 
observed a strong cross-model reliability in student learning rate 
estimates. Thus, we have converging evidence that student 
learning rates estimates for the Chapter 4 dataset are both reliable 
and valid. 


iAFM Student Intercept Estimate 


iBKT Student p(Init) Estimate 


o 
” ~) r=0540 ° o “9 
nN @ | p<0.0001 4 a8 
o a . a tr 0 Be 
a wT i] oo aH O68 
is = © | oo S Sah ® qe 
ao a° 8% Qs PEM cope 
£ = a 
nig ral a 
oe? ° o 
o 
Oo é a G 
o ° 
02 O04 O68 O8 1 
‘Chacter 3 Chapter 3 
iAFM Student Slope Estimate iBKT Student p(Learn) Estimate 
S| 5 F=0.34 
op a o P< 0.0001 
| o 
7) Ree was 
wT vT 5 “adr 
— _ 2 “8 o dmg Oo 
o oo "GP °° Fao 
3 B | R800 8 8 ae 
1] © =t apy ct & a 
P= = ol one is ral o es 
oO © “oO 0g o 
~ So, °° e a8 a 
o | 7 ° PP Po 
ee a a) 
o| $e S o 
Oo 


Chapter : 3 Chapter 3 


Figure 4. Relationships between student parameter estimates 
across the two datasets (same student population). 


4.4 Towards Understanding & Using Student 
Parameter Estimates 


4.4.1 Consistency of individual student constructs 


across datasets 
A core motivating question for collecting two datasets on the 
same group of students was: How consistent are 1AFM and iBKT 


model estimates of the student ability and student learning rate 
constructs across units? 


Figure 4 summarizes this relationship. Estimates of student ability 
are fairly consistent, especially as estimated by 1AFM. It seems 
sensible to interpret this as suggesting that overall student ability 
on Chapter 3 content is strongly related to overall student ability 
on Chapter 4 content, as we have shown estimates of student 
ability to be both reliable and valid. 


Estimates of student learning rate are less consistent. This may 
either be due to the fact that Chapter 3 estimates of student 
learning rate were neither very reliable nor very valid. 
Alternatively, the differences in student learning rate estimates 
across the two chapters may also be due to the fact that students 
genuinely learn different material at different rates. Unfortunately, 
we cannot resolve this question with the present data. We are 
currently collecting more datasets from this same group of 
students. If we obtain more reliable and valid student learning rate 
estimates in future data from this group of students, we can more 
confidently address this question in future research. 


4.4.2 Understanding student learning rate estimates 
Given that we established the reliability and validity of i:AFM and 
iBKT’s parameter estimates for the Chapter 4 dataset were 
reasonably reliable and valid, we sought to dig deeper into the 
explanatory power of these estimates. To this end, we conducted 
exploratory analyses on the Chapter 4 data to (1) visualize the 
learning trajectories of students with the highest vs. lowest 
estimated learning rates, (2) understand the relationships between 
estimated learning rates and prior-knowledge and motivational 
factors, and (3) understand the degree of variability in estimated 
learning rate across students. 
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and diligence. Grouped based on iAFM (Left) or iBKT (Right) 
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the means. 
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Figure 5 (top row) shows the aggregate learning trajectories for 
students split based either on their iAFM Student Slope estimates 
(top left) or their iBKT Student p(Learn) estimates (top right). The 
top 25% of student parameter estimates are plotted in blue, the 
middle 50% (between 1° and ay quartiles) are plotted in red, and 
the lower 25% are plotted in black. Dotted lines represent each 
respective model’s predicted earning trajectories. 


One striking pattern, especially in the 1AFM learning trajectories 
(top left), 1s the apparent relationship between average success on 
initial practice opportunities (1.e., prior knowledge) and estimated 
learning rate through the remaining opportunities. This 
observation is corroborated by a strong and significant correlation 
between 1AFM Student Intercepts and 1AFM Student Slopes 
(r=0.78, p<0.0001). One might interpret this to suggest that 
students who enter into the tutor with greater prior knowledge will 
be poised to gain more from the tutor (1.e., “the rich get richer’). 
Alternatively, students may have higher overall knowledge 
because they are fast learners. There may also be individual trait- 
based variables that positively drive both learning rate and overall 
achievement. 


To explore the relationships between measures of traits relevant to 
learning, we analyzed self-report survey data grouped by three 
factors (as described in Section 3.1): Competitiveness, Effort, and 
Diligence. The relationship between these measures and the high, 
medium, and low learning rate estimates from 1AFM and iBKT 
are shown in Figure 5 (bottom row). There appears to be a 
relationship between the means of each self-report measure and 
the general range that the learning rate estimate falls in. 


We analyzed the continuous relationship between students’ mean 
self-report rating along each dimension and their 1AFM learning 
rate estimates. In a linear regression predicting 1AFM Student 
Slopes, Competitiveness and Effort were not significant predictors 
but Diligence (B=0.016, p=0.007) was. In a similar linear 
regression predicting 1AFM Student Intercepts, again Diligence 
was the only significant predictor (B=0.02, p=0.04). Thus, among 
self-reported measures, the strongest dimension predicting both 
student ability/prior knowledge and student learning rate was the 
Diligence measure. Future work using causal modeling is 
warranted to discover the true nature of causality among these 
student-level constructs. 


Finally, we investigated the degree of variability in estimated 
learning rate across students. The first quantile of student learning 
rates from 1AFM is 0.03 logits and the third quantile of rates from 
iAFM 1s 0.08 logits. These can be conceptualized as canonical 
“slow” and “fast” learners. If we were to assume starting at 
around 70% performance (which comes from the model’s global 
intercept estimate), it would take the “slow” (0.03 logits) student 
approximately 25 opportunities to reach mastery (defined as 85%, 
the performance equivalent of a p(Know)=0.95, factoring in the 
guess and slip probabilities we used in the actual tutor). It would 
take the “fast” (0.08 logits) student approximately 11 
opportunities to reach the same place. 


4.4.3 Identifying wheel spinners 

The current definition of “wheel spinning” put forth in the 
Educational Data Mining community is the “phenomenon in 
which a student has spent a considerable amount of time 
practicing a skill, yet displays little or no progress towards 
mastery” [5]. There has been some controversy around the ideal 
way to measure mastery (e.g., 3 corrects in a row vs. reaching a 
certain p(Know) in knowledge tracing). Furthermore, some 
students may be classified as wheel spinners based on not 
mastering in a certain number of opportunities but they may still 
be making progress. 


We propose that reliable and validated estimates of individual 
student learning rate parameters, combined with KC learning rate 
parameters, could be used to estimate wheel spinning student/KC 
pairs in way that is agnostic to mastery status. Specifically, if the 
combined student and KC learning rate parameters in i1AFM 
predict no improvement or negative improvement across 
additional practice opportunities, and aren’t already at a high level 
of performance on their first opportunity (here we considered this 
to be 80% or above), we could consider the student to be wheel 
spinning on the KC. This method of estimating wheel spinning 
would be particularly useful for datasets with sparse data on some 
student-KC pairs, as it is not performance-dependent after the 
model has been fit to the full dataset. 


Based on this operationalized definition, we found that 
approximately 15% of student-KC pairs in the Chapter 4 dataset 
are estimated to be wheel spinning. That is, those students are not 
making progress on those KCs. This is a substantially lower 
estimate than the 25% reported by a recent wheel spinning 
detector in [5]. An interesting route for future work would be to 
do a direct comparison of the wheel spinning detector presented in 
[5] and our proposed student/KC learning rate identifier within the 
same dataset. This would allow for testing the possibility that 
some students who are still making progress, albeit extremely 
slowly, may be prematurely labeled as “wheel spinners” by [5]. 


5S. SUMMARY & LIMITATIONS 


Previous efforts towards more explanatory, interpretable, and 
actionable modeling advancements in the realm of 
skill/knowledge component model discovery have been promising 
in their potential and demonstrated impact on learning science and 
education. The present paper represents a novel effort to bring 
these deeper modeling approaches, focused on _ ensuring 
explanatory power, to the realm of individualized student- 
parameter models. 


Towards improving the reliability and validity of individualized 
student estimates, we collected two datasets from the same student 
population. Both datasets were “deep” along the dimension of 
student-KC observations. We fit 1AFM and iBKT to both datasets 
and showed that the models outranked their non-individualized 
counterparts in terms of fit to data and predictive accuracy. 
Importantly, we moved beyond these metrics to show that 
estimates of student ability were highly reliable GAFM and iBKT 
yielded strongly correlated estimates) and valid (estimates 
significantly predicted pretest data). 


This demonstration of confidence in the student ability estimates 
from iBKT, but even more so iAFM, has promising implications 
for the possibility of individualizing the student models that 
determine mastery in intelligent tutoring systems at /east in terms 
of overall student ability/knowledge. Our results also suggest that 
it would be reasonable to fix such student ability parameters, or 
set priors on them, based on either well-mapped pretest 
assessment data or prior (deep) data from those students’ learning. 


We also showed that estimates of student learning rate per 
practice opportunity were reliable and valid in one of the two 
datasets (Chapter 4). This is the first evidence, to our knowledge, 
of obtaining both reliable and valid student learning rates through 
a statistical model with individualized student parameters. We 
believe that this success is largely related to the amount and 
quality of per-student data we collected. 


With the confidence of having reliable and valid parameter 
estimates, we then proceeded to further investigate potential 
explanations for differences in student learning rates within the 
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Chapter 4 dataset. We found a strong and significant relationship 
between student ability and improvement rate as well as an 
additional effect of diligence, based on self-report measures. 
Further research is warranted to distill the causal relationships 
between these constructs. 


Knowing that a model’s estimates of individualized student 
parameters not only fit data well, but are reliable and valid, 
provides greater confidence for applying the model to (1) interpret 
the parameter estimates to understand characteristics of students, 
and (2) use the model to individualize the trajectory of mastery 
estimation for future students. 


Even though both iBKT and 1AFM outperformed their non- 
individualized counterparts in predicting performance in the 
Chapter 3 dataset, we did not find strong evidence of reliability 
and validity of the student-specific parameter estimates. Thus, we 
did not rely on that dataset to help us understand individual 
differences in learning rates. For the same reason, we could not 
confidently attribute the differences, in estimated student learning 
rates across the datasets, to true differences in students’ learning 
rates for the two chapters’ material. 


Although considering reliability and validity of models’ parameter 
estimates sets a higher bar than predictive accuracy for evaluating 
modeling advances, we believe those to be important 
characteristics of a model that is to be explanatory, interpretable, 
and/or actionable. Here, we have demonstrated that with a 
sufficiently good dataset, 1AFM and iBKT are individualized 
student models that can produce reliable and valid parameter 
estimates. 


Since our present work was limited to two datasets on one 
population of students, it is unclear the degree to which our 
modeling results will generalize, especially given that at least 
iAFM does not produce reliable, valid parameter estimates on 
more sparse datasets [9]. In addition, these results are limited to 
two specific statistical models produce individualized estimates 
student-level parameters, with a particular focus on individual 
differences in learning rate. There are other classes of models that 
could be extended to estimate differences in learning rate: for 
example, producing individualized estimates of the differential 
effects of success versus failure [15]. This would be an interesting 
focus for future work on this topic. 


Nevertheless, we have laid a foundation of methodology by which 
reliability and validity of parameter estimates, whether student- or 
KC-level, can be assessed. We have also demonstrated ways of 
using the reliable and valid student parameter estimates from 
iAFM and iBKT to yield interesting insights about student 
learning. 
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ABSTRACT 


In this paper, we investigate two purported problems with 
Bayesian Knowledge Tracing (BKT), a popular statistical 
model of student learning: identifiability and semantic model 
degeneracy. In 2007, Beck and Chang stated that BKT is 
susceptible to an identifiability problem—various models with 
different parameters can give rise to the same predictions 
about student performance. We show that the problem they 
pointed out was not an identifiability problem, and using an 
existing result from the identifiability of hidden Markov mod- 
els, we show that under mild conditions on the parameters, 
BKT is actually identifiable. In the second part of the paper, 
we discuss a problem that has been conflated with identifiabil- 
ity, but which actually does arise when fitting BKT models, 
semantic model degeneracy—the model parameters that best 
fit the data are inconsistent with the conceptual assumptions 
underlying BKT. We give some intuition for why semantic 
model degeneracy may arise by showing that BK'T models fit 
to data generated from alternative models of student learning 
can have semantically degenerate parameters. Finally, we 
discuss the potential implications of these insights. 


Keywords 
Bayesian Knowledge Tracing, identifiability, semantic model 
degeneracy 


1. INTRODUCTION 


Bayesian Knowledge Tracing (BKT) is a popular model of 
student learning that tries to predict the probability that 
a student knows a skill and the probability that a student 
will answer questions based on the skill correctly. The BKT 
model is a two state hidden Markov model (HMM) that 
posits students have either mastered a skill or not, and at 
every practice opportunity, a student who has not mastered 
the skill has some chance of attaining mastery. If a student 
has mastered a skill, they will answer a question correctly 
unless they “slip” with some (ideally small) probability, and 
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if the student has not mastered the skill, they can only guess 
correctly with some (ideally small) probability. In 2007, 
Beck and Chang stated that BKT is not identifiable, mean- 
ing that different settings of the four BKT parameters can 
lead to identical predictions about a student’s performance 
[7]. Whether or not BKT is identifiable is an important 
issue, because if BKT is not identifiable, it means that we 
would fundamentally need other criteria (beyond accurately 
modeling student performance data) to fit BK'T models. 


However, in this paper, we show that BK'T is actually an 
identifiable model, under mild conditions on the parameters 
that should always be satisfied in practical settings. This 
result follows from BKT being a special case of a hidden 
Markov model and therefore it inherits identifiability results 
that prior work has proven for HMMs. This implies no ad- 
ditional criteria beyond predictive accuracy are needed to 
identify a single BKT model that best explains observed 
student performance, under the assumption that learning 
can accurately be modeled by a BKT. We then describe three 
potential issues with BK’T models that may have been mis- 
construed as an identifiability problem in the literature. Note 
that our goal is by no means to criticize prior researchers, as 
such researchers helped identify some important limitations 
of Bayesian Knowledge Tracing, but these limitations do not 
stem from a lack of identifiablity. 


In the second part of this paper, we focus on one of the 
issues that has been conflated with identifiability, but which 
actually does arise when fitting BKT models, semantic model 
degeneracy—the model parameters that best fit the data are 
inconsistent with the conceptual assumptions underlying 
BKT. We give a critical look at the types of semantic model 
degeneracy in the literature and then give some intuition for 
why this problem may arise by showing that BKT models 
fit to data generated from alternative models of student 
learning can have degenerate parameters. We further show 
that fitting models to sequences of different lengths generated 
from the same underlying model can result in different forms 
of semantic degeneracy. We show that these insights can 
have important implications on how these models should be 
used. 
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2. BAYESIAN KNOWLEDGE TRACING 
The Bayesian Knowledge ‘Tracing model is a two-state hid- 
den Markov model that keeps track of the probability that a 
student has mastered a particular skill and the probability 
that the student will be able to answer a question on that 
skill correctly over time. At each practice opportunity 2 > 1 
(i.e., when a student has to an answer a question correspond- 
ing to the skill), the student has a latent knowledge state 
Kk; € {0,1}. If the knowledge state is 0, the student has 
not mastered the skill, and if it is 1, then the student has 
mastered it. ‘The student’s answer can either be correct or 
incorrect: C; € {0,1} (where 0 corresponds to incorrect and 
1 corresponds to correct). After each practice opportunity, 
the student is assumed to master the skill with some proba- 
bility. The BKT model is parametrized by the following four 
parameters: 


e P(Lo) = P(A, = 1): the initial probability of know- 
ing the skill (before the student is given any practice 
opportunities) 


e P(T) = P(Ki41 = 1|K; = 0): the probability of mas- 
tering a skill at each practice opportunity (if the student 
has not yet mastered the skill) 


e P(G) = P(C, = 1|K; = 0): the probability of guessing 


e P(S) = P(C, = 0|K; = 1): the probability of “slip- 
ping” (answering incorrectly despite having mastered 
the skill) 


3. IDENTIFIABILITY 

In their 2007 paper, Beck and Chang claimed that BKT is 
not identifiable, illustrating this with a particular example of 
three different BKT models [7]. For concreteness we include 
these models in Table 1. The authors consider the case of 
predicting the probability of correctness under these three 
models as the students receive practice opportunities, but in 
absence of any observation about the student’s performance. 
They use plots as in Figure 1 to claim that the three models 
make very different predictions about student knowledge 
(Figure 1 (a)), but make identical predictions about student 
performance (Figure 1 (b)). They claim, 


All three of the sets of parameters instantiate 
a knowledge tracing model that fit the observed 
data equally well; statistically there is no justifica- 
tion for preferring one model over another. This 
problem of multiple (differing) sets of parameter 
values that make identical predictions is known 
as identifiability. 


However, this is not correct since no data was used to fit 
these curves; the curves are predicting the probability that 
a student will know the skill or will answer the skill cor- 
rectly at each practice opportunity 2, when we have no prior 
performance or data on the student. In order to take past 
data from a student into account, we actually want to pre- 
dict P(K; = 1|Ci, a fed Cea4)) and PUG; = 1|Ci, ae Ci-7) and 
this is indeed what we do in practice when doing knowledge 
tracing; we make predictions based on our past observations. 
Figure 2 shows the curves predicting these conditional proba- 
bilities for a particular sequence of correct /incorrect answers 
for a student (namely we use (1,0,0,0,0,0,0,1,1)). We find 
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that even when we condition on a single observation (i.e., 
for P(C2z = 1|C1)), the three models make vastly different 
predictions, and as we collect more data, the models con- 
tinue to make very different predictions. In fact, except for 
P(C = 1), the models never agree on the probability that a 
student would answer the step correctly. 


Formally, a model is said to be identifiable if there are no two 
distinct sets of model parameters 6 and 6’ that can give rise 
to the same joint probability distribution over observations 
under that model. As far as inference is concerned, identifia- 
bility means that the likelihood function of the model has 
only one global maximum, so inference of the true model 
parameters is possible. In the case of BKT, the model would 
be identifiable if for any two distinct sets of BKT parameters, 
6 and 6’, 


PH CC oi shint CP CC, 2550) 


for some n > 1. What Beck and Chang show is that there can 
be infinitely many models that share the same set of marginal 
distributions P(C), P(C2),...,P(Cr). This does not mean 
the model is unidentifiable. As we saw from Figure 2, the 
conditional distribution P(C;,|Ci,...,Cn—1) is quite different 
for each model, and so the joint distribution P(C1,...,Cn) 
is also very different for the three models. 


It turns out there has been a substantial amount of work, 
going back 50 years and continuing to this day, on finding the 
conditions for which hidden Markov models are identifiable 
[15, 1, 2, 17, 10]. Although much of the literature focuses on 
particular types of HMMs (e.g., stationary, irreducible) that 
do not include the standard BKT model, Anandkumar et al. 
have recently shown that, subject to some non-degeneracy 
conditions, a large class of HMMs, which includes BKTs, is 
identifiable with just the joint probability distributions for 
up to three sequential observations [4]. That is, knowing 
P(C1), P(C1, C2), and P(C1, C2, C3) is enough to infer the 
unique model parameters, subject to non-degeneracy condi- 
tions. In our context, the conditions are that P(Lo) ¢ {0, 1}, 
P(T) #1, and P(G) 4 1— P(S). This suggests that as long 
as we have more than two observations per student, BK'T 
models with reasonable parameters are identifiable and there 
is a single global maximum to the likelihood function. Feng 
recently independently showed the same result directly for 
BKT models, except without requiring the condition that 
P(Lo) # 0 [9]. One advantage of relying on general identifia- 
bility results for HMMs is that we can use the same results 
to show the conditions under which related student models 
that can also be modeled as HMMs are identifiable’. 


This misuse of the term “identifiability” has lead to multiple 
subsequent papers in the educational data mining commu- 
nity throughout the past decade which have similarly given a 
mistaken description of the underlying phenomena [5, 16, 13, 
12]. Two papers, however, have correctly identified that the 


‘For example, for the BKT model with forgetting, where 
P(F) = P(Ki4i = 0|K;i = 1) 4 0, we can show that the 
model is identifiable with the same conditions, except that we 
require P(T’) 4 1 — P(F’) instead of P(T) 4 1. We can also 
easily show the conditions under which multi-state extensions 
of BKT such as the model introduced in Section 4.2 are 
identifiable. ‘These conditions can be derived from Condition 
3.1 and Proposition 4.2 of [4]. See also the note under 
Proposition 3.4 of [3]. 
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Model 


Parameter Knowledge Guess’ Reading Tutor 
P(LZo) 0.56 0.36 0.01 
P(T) 0.1 0.1 0.1 
P(G) 0 0.3 0.53 
P(S) 0.05 0.05 0.05 


Table 1: The three BKT models used by Beck and Chang [7] to claim BKT is unidentifiable. The models are chosen to have 
very different semantic interpretations. The Knowledge model requires the student to master the skill to get it correct, the 
guess model relies on the student guessing, and the Reading Tutor model has an even higher probability of guessing, but it was 


based on models actually used by the Reading Tutor [14]. 


LG 


in 

wl 

yy 
Knowledge Knowledge 
Guess Guess 
Reading Tutor Reading Tutor 

“1 2 3 4 5 6 7 8 9 10 a 4 5 6 7 8 a 10 
Practice Opportunity Practice Cpportunity 
(a) Learning Curve (b) Performance Curve 
Figure 1: Hypothetical learning and performance curves for three models from [7], in absence of any data. 
a 
G 
: — Knowledge 

eS -- Guess 

7 Reading Tutor 

be 


— Knowledge 
Guess 


Reading Tutor 


3 4 5 6 7 8 9 10 “1 2 3 4 5 6 7 8 9 
Practice Opportunity Practice Cpportunity 


(a) Learning Curve (b) Performance Curve 


Figure 2: Learning and performance curves for three models from [7] conditioned on all past observations for a student whose 


observed trajectory is as follows: (C1, C2, C3, C4, Cs, C6, C7, Cs, Co) = (1,0, 0,0, 0,0, 0, 1, 1) 
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“identifiability problem” is limited to the case where there 
is no data [18, 11]. Even though this is not a statistically 
precise claim, it does show that some researchers have the cor- 
rect understanding behind the phenomenon. Van de Sande 
distinguishes between the two cases where predictions are 
made in the absence of data and where they are made in the 
presence of data, and claims that the source of the identifia- 
bility problem in the former case is that the predictions can 
be completely determined by three parameters, so there is 
a degree of freedom [18]. When we are making predictions, 
however he claims there is no identifiability problem, because 
P(kK;|C;) depends on four parameters [18]. While he has 
correctly identified the absence of an identifiability problem 
in the presence of data, we believe that there is still confu- 
sion about the identifiability problem in the community (e.g., 
some of the papers that show a misunderstanding of the issue 
are more recent than [18]). We hope to make the absence 
of an identifiability problem more clear and elucidate the 
phenomena and misconceptions surrounding it. Gweon et 
al. also distinguish between two cases which they refer to as 
the BKT model without measurement and the BKT model 
with measurement, and show, as van de Sande did, that the 
former depends on three parameters (hence the “identifia- 
bility problem”) whereas the latter depends on all four [11]. 
However, they claim this does not necessarily mean that 
the BKT model with measurement does not suffer from an 
identifiability problem, and actually claim that it still does 
suffer from an identifiability problem, because empirically, 
they found that for some data, fitting BKT models many 
times resulted in a wide spread of possible parameters [11]. 
However, this cannot be due to the presence of an multiple 
global maxima, which we have shown cannot exist, and hence 
must be due to multiple local optima. 


The work closest to ours is Feng’s recently published disser- 
tation [9]. The author gives a similar explanation to ours for 
why Beck and Chang’s claim was incorrect and also proves 
that the BKT model is identifiable directly [9]. However, 
we believe the exposition there is perhaps less accessible to 
the educational data mining community and will likely not 
obtain the visibility needed to clear the misunderstandings 
surrounding the identifiability of BKT. In this paper, we 
not only focus on identifying the misidentified identifiability 
problem, but also understanding the confusion surrounding 
it as well as pointing out actual issues with fitting BKT 
models that have been conflated with identifiability. This is 
the focus of the rest of the paper. 


There are three potential sources of confusion that we believe 
could be and have been misconstrued as an identifiability 
problem: 


1. A priori predictions. That multiple models, which 
make very different claims about student’s knowledge 
state over time, could predict the same probability 
that students answer questions correctly over time in 
the absence of data. ‘This is the problem that Beck 
and Chang conflated with identifiability, and many 
researchers thereafter also treated as identifiability. As 
we showed above, van de Sande, Gweon et al. and 
Feng correctly identified what is happening here [18, 
11, 9]. 
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2. Multiple local optima. It is well known that the ex- 
pectation-maximization algorithm that is commonly 
used to fit BKT models is suceptible to converging 
to local optima of the likelihood function rather than 
converging to the global optimum. While Beck and 
Chang clearly did not conflate this with the identifi- 
ability issue, we saw that other researchers such as 
Gweon et al. have possibly conflated the two. In order 
to avoid local optima, one can use a grid search over 
the entire parameter space or run multiple iterations of 
the expectation-maximization algorithm with different 
initializations of the parameters. 


3. Semantic model degeneracy. Baker et al. identified an- 
other problem with BKT models, which they termed 
model degeneracy [5]. A model is said to be seman- 
tically degenerate” when it is inconsistent with the 
conceptual assumptions underlying the BKT model. 
The problem is when the model that best fits our data 
is semantically degenerate. Even though Baker et al. 
clearly contrasted this to the (supposed) identifiability 
problem, we claim that this is the problem that Beck 
and Chang attempted to fix in their paper. We will 
now focus on better understanding this problem. 


4. SEMANTIC MODEL DEGENERACY 


In their paper, Beck and Chang propose a way to get around 
the identifiability problem. They propose using Dirichlet pri- 
ors to encode prior beliefs about the BKT parameters, which 
will in turn bias the model search towards more reasonable 
parameters [7]. They motivate their method as follows: 


We have more knowledge about student learning 
than the data we use to train our models. As 
cognitive scientists, we have some notion of what 
learning “looks like.” For example, if a model 
suggest that a skill gets worse with practice, it 
is likely the problem is with the modeling ap- 
proach, not that the students are actually getting 
less knowledgeable. ‘The question is how can we 
encode these prior beliefs about learning? 


The problem they appear to be describing is that some models 
have parameters that do not match our intuitions of student 
learning, i.e., they are exactly describing the issue of semantic 
model degeneracy (and not that of unidentifiability). Baker 
et al. later provide another solution to tackling semantic 
model degeneracy by using contextual features to estimate 
the guess and slip parameters [5]; however, interestingly they 
did not view Beck and Chang’s original solution as a way of 
tackling semantic model degeneracy, treating it as a way to 
tackle identifiability as the authors originally claimed. 


Having shown that identifiability is not an issue with BKT, 
and given that there are easy ways to tackle the existence 
of local optima, we believe semantic model degeneracy is 
perhaps the most important problem with respect to fitting 
BKT models that needs to be better understood and tackled. 
Essentially, the problem arises because the BKT is simply a 


?We refer to this property as semantic model degeneracy to 
distinguish it from mathematically degenerate parameters 
that would result in BKT models being unidentifiable, as 
described above. 
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particular form of a two-state hidden Markov model and it 
will try to fit the best two state hidden Markov model it can 
to the data; our model fitting procedures do not understand 
that the K; = 1 state is supposed to correspond to mastering 
a skill, and so it might fit a model that does not match our 
intuitions of mastery. We will try to understand this in more 
detail below, but first we aim to characterize the types of 
semantic model degeneracy that have been pointed out in 
the literature. 


4.1 Types of Semantic Model Degeneracy 
Baker et al. distinguish between two forms of semantic model 
degeneracy: theoretical degeneracy and empirical degeneracy 
[5|. They define a model to be theoretically degenerate when 
either the guess or the slip parameter is greater than 0.5. 
They define a model to be empirically degenerate if one of 
two things occur: (1) for some large enough n the model’s 
estimate of the student having mastered the skill decreases 
after the student gets the first n skills correct or (2) for some 
large enough m, the student does not achieve mastery (our 
estimate of the student having mastered the skill does not go 
beyond 0.95) even after m consecutive correct responses [5]. 
The authors arbitrarily chose the values n = 3 and m = 10. 
Note that the first form of empirical degeneracy is only 
possible if 1 — P(S) < P(G) (i.e., the student is more likely 
to answer a question correctly if they have not mastered a skill 
than if they have mastered a skill), as was shown by van de 
Sande [18]. This is true, even for n = 1. Thus, this first notion 
of empirical degeneracy is equivalent to P(G) + P(S) > 1, 
which implies either P(S) > 0.5 or P(G) > 0.5, meaning 
that it always implies theoretical degeneracy! Huang et al. 
have noted that while P(G) + P(S) > 1 definitely implies 
semantically degenerate parameters as it contradicts mastery, 
the condition that P(G) < 0.5 and P(S) < 0.5 may not 
always be necessary for the parameters to be semantically 
meaningful, since, for example, there may be some domains 
where the student can guess the correct answer easily [12]. 
We agree that suggesting P(G) < 0.5 is degenerate does 
seem somewhat arbitrary depending on the domain; however, 
we do think P(S) > 0.5 should be characterized as a form 
semantic degeneracy, because, as Baker et al. claimed, it does 
not make sense for a student who has mastered a skill to 
answer questions of that skill incorrectly most of the time— 
that goes against our intuitions of what mastery means. 
In any case, it does not seem like the distinction between 
theoretical and empirical degeneracy is a clear one, so we 
suggest categorizing the forms of semantic model degeneracy 
by what they suggest about student learning: 


e Forgetting: This is a result of P(G) + P(S) > 1, which 
suggests that not only are students not learning, but 
that students have some probability of losing their 
knowledge over time. Another way to view this degen- 
eracy is that the state we would conceptually call the 
mastery state is now the state where performance is 
worse. 


e Low Performance Mastery: This is a result of P(S) > 
0.5. Alternatively, we can set our threshold for low 
performance mastery to be lower (e.g., P(S) > 0.4). 


e High Performance Guessing: This is a result of P(G) > 
t, where t is some threshold. As mentioned earlier, 
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this seems like a weak form of degeneracy, as students 
can often guess an answer easily even if they have not 
mastered a skill, but we can set ¢ to a large enough 
value, to make this a form of model degeneracy. 


e High Performance + Learning: This is the second form 
of empirical degeneracy given by Baker et al. [5]: for 
some choice of m, the probability that the student 
has achieved mastery is less than some threshold p 
(typically taken to be 0.95) after m consecutive correct 
responses 


4.2 Sources of Semantic Model Degeneracy 
We will now consider a possible explanation for why BKT 
models are so prone to semantic model degeneracy (which 
we believe to be part of the reason that researchers look 
towards identifiability and local optima to explain the strange 
parameters that result from fitting BKT models). First of all, 
note that forgetting degeneracy will occur whenever students 
actually do forget or when they learn misconceptions; it is 
not unreasonable to believe that students will sometimes 
learn and reinforce a misconception, causing their knowledge 
of some skill to decrease over time. Thus, while this form 
of degeneracy technically violates our notion of mastery, it 
is to be expected if we switch the semantic interpretation 
of the two states and suppose that students forget instead 
of learn. We now consider sources of the other forms of 
semantic model degeneracy. We claim that such forms of 
semantic model degeneracy can result from not accurately 
being able to capture the complexity of student learning with 
a two state HMM. When this is the case, fitting the data 
with a two state HMM will result in trying to find the best fit 
of the data for a two state HMM, and not to come up with 
a model that tries to accurately model the data while also 
matching our intuitions about what it means for a student 
to have mastered a skill. 


To support our claim, suppose student learning is actually 
governed by a 10-state HMM with ten consecutive states 
representing different levels of mastery. From each state, the 
student has some probability of transitioning to the next 
state (slightly increasing in mastery), and from each state, 
the student has a probability of answering questions correctly, 
and this probability strictly increases as the student’s level of 
mastery increases. Specifically consider the model presented 
in ‘Table 2. Now suppose we try to use a standard BK'T 
model to fit data generated from this alternative model of 
student learning. The first two columns of Table 3 show the 
parameters of BKT models fit to 500 sequences of 20 practice 
opportunities or 100 sequences of 200 practice opportunities, 
both generated from the the model in Table 2. Notice that 
the model fits (nearly) degenerate parameters in both cases. 
When we only have 20 observations per student, the model 
estimates a very high slip parameter; this is because it has 
to somehow aggregate the different latent states which cor- 
respond to different levels of mastery, and since not many 
students would have reached the highest levels of mastery 
in 20 steps, it is going to predict that students who have 
“mastered” the skill are often getting it wrong. However, 
what’s more interesting is that for the same model, if we 
simply increase the number of observations per student from 
20 to 200, we find that the slip parameter is reasonably small, 
but now the guess probability is 0.49! This is because, by 
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Table 2: Alternative model of student learning where there are ten levels of mastery. 


10-State HMM AFM 
Parameter 20 200 20 200 


P(Lo) 0.30 0.001 0.09 0.001 
P(T) 0.05 0.02 0.05 0.05 
P(G) 027 049 0.14 0.28 
P(S) 044 0.13 0.46 0.03 


Table 3: BKT models fit to data generated from the model 
described in Figure 2 and an additive factors model described 
in the text. The first column for each model is fit to 500 
sequences of 20 practice opportunities, while the second 
column is fit to 100 sequences of 200 practice opportunities. 
The models were fit using brute-force grid search over the 
entire parameter space in 0.01 increments for the parameters 
using the BKT Brute Force model fitting code [6]. 


this point most students have actually reached the highest 
level of mastery, so to compensate for the varying levels of 
mastery that occurred earlier in student trajectories, the 
model will have to estimate a high guess parameter. So we 
find that not only can alternative models of student learning 
lead to fitting (near) degenerate parameters, but varying 
the number of observations can lead to different forms of 
degeneracy! This is a counterintuitive phenomenon that we 
believe is not the result of not having enough data (students) 
to fit the models well, but rather the result of the mismatch 
between the true form of student learning and the model we 
are using the fit student learning. 


We find similar results if we fit a BKT model to data gener- 
ated from another alternative model of student learning that 
is commonly used in the educational data mining community, 
the additive factors model (AFM) [8]. In particular, we used 
the model 


1 


Po Hl) = : 
o ) 1+ exp(—0 + 2 — 0.11) 


where 6 ~ N‘(0, 1) is the student’s ability’. The second two 
columns of Table 3 show the parameters of BK'T models fit 
to data generated from this model. We again find that when 
using only data with 20 practice opportunities, we fit a high 
slip parameter, but when we using data with 200 practice 
opportunities, we fit a higher guess parameter and a very 
small slip parameter. 


Additionally, notice that for the parameters fit to the 10- 
state HMM, the probability of transitioning to mastery is 


3This model suggests that students who are two standard 
deviations above the mean initially will answer correctly half 
the time, and after 20 practice opportunities the average 
student will answer correctly half the time. 
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very small when we fit to sequences with 200 practice op- 
portunities. Since the transition probability is small and the 
guess probability is large, we also have high performance + 
learning degeneracy for this model for m = 10. That is, 


P(Ki1 = 1\C1 = 1,Co2 = ees Oa, = 1) ~ 89 < 0.95 


This is yet another form of degeneracy that does not exist 
in the model fit to sequences of 20 practice opportunities. 
Furthermore, notice that when we have 200 observations, 
the probability of transitioning to mastery is smaller than 
P(Ki = k+1\|K;i = k) for all states i in the model that 
generated the data (Table 2). Again, this is because the best 
fitting BKT model will aggregate low performing states and 
high performing states, so a single transition in the BKT 
model between these two aggregate states will have to loosely 
correspond to the student transitioning several times in the 
actual 10-state HMM. Thus, while the learned BK'T model 
makes it appear as though learning happens very slowly, 
according to the true student model, learning actually occurs 
much more often but in more progressive increments. This 
suggests that if we use some automated technique to detect 
if a skill is useful for student learning, we may conclude it is 
not, if we do not allow for the possibility that students are 
learning progressively. 


These observations have important implications for how 
learned models can be used in practice. Using such a BKT 
model to predict student mastery can lead to problematic in- 
ferences. For example, for the first model in Table 3, the BKT 
model assumes that when a student has reached mastery, 
they have a 56% chance of answering a question correctly, 
whereas a student who has actually mastered the skill will 
have a 90% chance of answering correctly (see Table 2). Thus, 
an intelligent tutoring system that uses such a BKT model 
to determine when a student has had sufficient practice on a 
problem, will likely give far fewer problems to the student 
than they actually need in order to reach mastery! 


There are several potential ways that future work can pro- 
ceed in light of these findings. One is that we should be 
giving our model fitting procedures more domain knowledge 
about the kind of model we want it to fit. This is essentially 
what Beck and Chang did by using Dirichlet priors [7] and 
what Baker et al. did by estimating the guess and slip param- 
eters using context [5]. But perhaps there are other ways of 
doing this where we do not need to give context-dependent 
domain knowledge to the model per se, but rather come up 
with a model that realizes the difference between a student 
having mastered a skill or not (which the BKT model cannot 
do). However, this may not be ideal in some cases where 
student learning cannot accurately be modeled by BK'T with 
semantically plausible parameters. For example when we 
have forgetting degeneracy, we should probably not force 
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the parameters to suggest learning is occurring when it may 
not be. Another way to proceed is to consider alternative 
student models, which is an active area of educational data 
mining research. Perhaps, obtaining semantically degenerate 
parameters from a fit should signal that our students may 
be learning in more complicated ways than the simple BKT 
model can predict, and so we should try to find alternative 
models that fit our data better without yielding semantically 
degenerate parameters. Finally, even if our model is seman- 
tically degenerate, it does not necessarily make the BK'T 
model useless. The result of fitting a BKT model is that we 
get the best fit of the data given that we are modeling the 
data with a two-state HMM (if we disregard local optima). 
Presumably, such a model can give us some insights about 
student learning even if it is not modeling student mastery. 
So perhaps we can use such semantically degenerate models 
to understand student learning rather than to predict student 
mastery. 


5. CONCLUSION 


We have explored the issues of identifiability and semantic 
model degeneracy in Bayesian Knowledge Tracing. We have 
shown that what researchers posited was an identifiability 
problem is actually not an identifiability problem, and by 
using a result from the literature on learning hidden Markov 
models, we showed that an identifiability problem does not 
exist for BKT models (with the exception of some mathemat- 
ically degenerate cases that should not come up in practice). 
We then examined the various issues with fitting BKT mod- 
els that have been conflated with identifiability. We offered 
what we believe to be new insights on one potential source of 
semantic model degeneracy. We believe analyzing the sources 
of semantic model degeneracy in more detail can be a fruitful 
direction for future research. For example, it could be useful 
to know what BKT parameters result from fitting various 
other popular models of student learning. It would also be 
informative to see if we can find automated ways of detecting 
which assumptions of BKT are not met in our data (e.g., the 
number of levels of mastery, the independence of different 
skills). Such analyses could help in devising better student 
models, and ultimately may lead to a better understanding 
of student learning. 
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ABSTRACT 


Although millions of students have access to varieties of 
learning resources on Massive Open Online Courses (MOOC- 
s), they are usually limited to receiving rapid feedback. Pro- 
viding guidance for students, which enhances the interaction 
with students, is a promising way to improve learning ex- 
perience. In this paper, we consider to show students the 
emphasis of lectures before their learning. We propose a 
novel framework that automatically generates and ranks the 
topics within the upcoming chapter. We apply the Latent 
Dirichlet Allocation (LDA) model on the subtitles of lectures 
to generate topics. We then rank the importance of these 
topics through a particular PageRank method, which also 
leverages structural information of lectures. Experimental 
results demonstrate the effectiveness of our approach, with 
a 18.9% improvement in Mean Average Precision (MAP). At 
last, we simulate two cases to discuss how can our framework 
guide students according to their learning status. 


Keywords 
Massive Open Online Courses (MOOCs); Guidance for Stu- 
dents; Topic Model; PageRank. 


1. INTRODUCTION 


With recent developments of Massive Open Online Cours- 
es (MOOCs), millions of students have access to abundant 
high-quality learning resources at their convenience and with 
no cost. Despite all the advantages, students on MOOCs are 
usually limited to receiving rapid feedback, and the lack of 
interaction with instructors and peers would reduce their 
learning experience |6, 16]. Previous explorations of course 
design and intervention have shown the guidance would im- 
prove student learning experience and performance [3, 11]. 
However, few works researched on providing guidance at the 
early stage of learning process. According to the strategy of 
learning design, Conole suggested teachers design a vision 
for the course in terms of knowledge [6]. 


Traditionally, teachers emphasize important concepts in class- 
es. But in MOOCs, not all the teachers underline the key 
points when giving the lectures. Moreover, even if teachers 
have repeated the key points in the videos, MOOC students 
are prone to miss such information. A study of edX studen- 
t habits found that even certificate-earning students only 
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viewed the first 4.4 minutes of 12 to 15 minute videos [7]. 


With guidance that highlights the most important topics, 
students can have an vision of key points before watching 
lectures, or briefly review these knowledge if they are go- 
ing to take assignments. Specifically, important topics are 
more likely to be involved in assignments in the perspective 
of students [2, 10], so that such guidance will be valuable 
for those who have little leisure time but want to complete 
the course. Thus, such automatic guidance is helpful for 
students to know the emphasis of upcoming lectures. 


Previous studies in knowledge tracing represented key points 
as knowledge components, which are inferred from student 
performance on assignment items [9]. Besides, some works in 
MOOGCs simply defined knowledge components as one single 
problem or chapter [15, 17]. However, most MOOCs don’t 
have enough problem items for accurate definition. Different 
from these works, our framework generates topics from video 
subtitles, which is more general for MOOCs. Moreover, our 
work is the first to rank these topics, by leveraging both 
textual and structural information of videos. 


Our work focuses on automatically providing students with 
guidance at the early stage of learning process. We propose 
a novel framework that takes the video subtitles as inputs 
and suggests students the most important topics within the 
upcoming chapter. ‘To address such a task, we decompose it 
into the following three steps: (1) Generate topics from sub- 
titles by LDA model; (2) Decide the importance of phrases 
based on a particular PageRank method; (3) Smooth the 
PageRank value and measure the importance of topics. The 
experiments show the effectiveness of our algorithm, which 
improves by 18.9% in Mean Average Precision (MAP). We 
also use two cases to illustrate how our framework help dif- 
ferent students according to their learning status. The main 
contributions of our work are listed as: 


e We design a novel framework for MOOCs that auto- 
matically provides students with a vision of important 
topics at the early stage of their learning. 


e We propose a particular PageRank method to rank the 
importance of topics within the upcoming chapter. 


e The experiments and simulated cases show the effec- 
tiveness of our algorithm and how it works. 
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2. RELATED WORK 


2.1 Design and Intervention 

Students participate in MOOCs through the interactions 
with lectures, assignments, and forums. Interventions were 
designed to enhance their engagement and learning experi- 
ence. Previous work explored the effect of video production 
on student engagement [8], suggested detecting confusion 
in forums [18], and showed that immediate feedback of as- 
signments can improve learning performance [11]. However, 
most of recent works designed the interventions for students 
during or after their learning process. 


Basu et al.[3] presented an intervention that assists students 
in understanding detailed specification of assignments before 
their attempts. However, this work addressed the problem 
of assignments, but not learning by watching lectures. Our 
work focuses on providing guidance for students with a vi- 
sion of the key points they are going to learn. 


2.2 Topic Model 


To automatically summarize the content of lectures, NLP 
techniques are commonly used to extract the keyphrases in 
the text. Topic model is designed for discovering the laten- 
t topics from a collection of documents. Among different 
algorithms, Latent Dirichlet allocation (LDA) is the most 
common topic model currently in use [4]. 


For MOOCs, the works concentrating on knowledge tracing 
defined the knowledge component as a chapter or a problem 
item[15, 17], but such representation deviates from common 
sense. Inspired by the work from Matsuda et al.{12], which 
applied LDA model on assignment items and viewed the 
auto-generated clusters as knowledge component candidates, 
we transfer this method to the videos in MOOCs. In our 
work, we generate latent topics from video subtitles, and 
define each topic as a probability distributions over phrases. 


2.3. Ranking Model 


Students are unlikely to post questions before their learning, 
especially in MOOCs. Therefore, in order to provide guid- 
ance at early stage, we should rank the topics through the 
content analysis of the lectures. PageRank is a graph-based 
ranking algorithm and it is a common way to measure the 
relative importance of items [14]. 


Some variants, like TextRank, created an undirected phrase 
graph from natural language texts for text processing, such 
as keyphrase extraction, extractive summarization [5, 13]. 
Different from these works, we view the MOOC video subti- 
tles as the documents and leverage the structural relation be- 
tween lectures. More specifically, we design a novel method 
to construct the phrase graph, which assigns phrase relations 
in different documents with different weights. 


3. DATA PREPARATION 


Recent MOOC providers also allow registered users to down- 
load the lecture videos and subtitle files. Therefore, it is 
convenient for researchers to analyze the video content as 
documents, using natural language processing (NLP) tech- 
niques. The dataset for this paper consists of a Coursera 
course “Data Structure and Algorithm”. The filmed lectures 
are hierarchically organized. 
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To analyze the content of the lectures, we first extract noun- 
phrases from each subtitle for preprocessing, based on Python 
library TextBlob. Previous studies demonstrated that nouns 
and noun-phrases tend to produce keywords that typically 
express what the content is about [1]. Thus, the lectures 
can be represented as lists of consecutive phrases. There are 
3,964 different phrases in total, and each lecture has an av- 
erage length of 129.4 (including repeated phrases). Besides, 
the course sets up a quiz for every single chapter and two 
exams. The questions in these assignments are randomly 
sampled from a problem set, which contains 254 different 
items. 


4. METHODS 


The main objective of our research is to automatically pro- 
vide students with guidance before their learning, which tells 
them the most important topics of the upcoming chapter. 
Based on such guidance, students can have a vision of the 
course, or check whether they have achieved these topic- 
s before they take an assignment. In brief, we propose a 
novel framework for MOOCs that takes a set of subtitles as 
inputs and returns a ranked list of topics ordered by their 
importance. Figure 1 shows the overall architecture of our 
framework, which can be decomposed into three steps. 


In the first step, we use LDA model to generate topics from 
the subtitles of lectures. In the second step, we define a 
particular PageRank method for ranking the importance of 
phrases. Finally, we apply three transfer functions to reas- 
sign the importance value of phrases and measure the im- 
portance of topics. 


4.1 Generating Topics from Subtitles 

Then, we aim to generate topics for each chapter separately. 
Inspired by previous work, which applied LDA model on as- 
sessment items [12], we transfer this method to the subtitles 
of videos in MOOCs. LDA model is a generative probabilis- 
tic model that allows a set of observations to be explained 
by unobserved groups [4]. It is known to discover latent top- 
ics of a set of documents. In our cases, we denote lectures 
as documents and phrases as words. Specifically, the model 
takes the phrase lists from a chapter as inputs, and returns 
a set of latent topics, where each topic is characterized by a 
distribution over phrases. 


In practice, we implement the model based on a Python li- 
brary “lda”. The number of iteration is set at 200 and the 
number of topics is dynamic with the number of lectures in 
the chapter, considering that different chapters have differ- 
ent number of topics. In addition, if the topics have been 
predefined by experts (given n keywords for each topic), we 
can also take such information as an alternative, instead of 
generating topics by LDA model. Specifically, to construct 
probability distributions over phrases as topics, it just needs 
to set the probabilities of corresponding phrases as 1/n and 
set the others as 0. 


The output of this step for each chapter is a set of latent 
topics, in the form of probability distribution over phrases. 
To have an intuitive sense, we display each topic as a tu- 
ple, including three phrases with the highest probability in 
the distribution. Table 1 shows the topics generated from 
“Graph”, which is one of the chapters in this course. 


lol 


PageRank 


Compute 


Importance 


new 
value 


Step 3 


Figure 1: Overview of the framework that takes subtitles of MOOCs as inputs, and generates a ranked list 


of topics to students. 


Chapter 8: Graph 


(Kruskal algorithm, algorithms, data structure) 
(adjacency list, adjacent matrix, list contains) 
(MST, Prim algorithm, minimum weight edge) 


(DAG, start node, data structure) 
(DFS, topological sort, post process) 
(Dist, shortest path, source node) 
(old value, time complexity, Dijkstra) 


Table 1: The topics generated by LDA model in 
Chapter “Graph”. 


4.2 Ranking the Importance of Phrases 

Our basic intuition is that important phrases are more like- 
ly to be mentioned in class. Moreover, when teachers talk 
about a new topic, they often briefly retrospect correspond- 
ing topics as comparisons, which enables us to connect a re- 
lation between phrases in different chapters. Based on these 
latent relation, we design a particular PageRank method, 
which leverages both textual and structural information of 
lectures, to rank the importance of phrases within chapters. 


Our ranking algorithm can be decomposed into three pro- 
cesses. The first is to construct a phrase graph for each 
chapter. Then, for each chapter, we combine all the graphs 
generated by previous chapters that have been released be- 
fore. At the end, we define a random walk on the graph to 
compute the importance magnitude of phrases. The output 
of this step is a ranked list of phrases, along with the value 
of their importance. 


4.2.1 Construction 

Intuitively, we consider that two important phrases occur- 
ring on close position suggest they have a relation between 
each other. PageRank is an algorithm for measuring the im- 
portance of website pages based on the webgraph [14]. In 
our cases, we denote the phrases as nodes and connect two 
phrases if they are close in the lecture. 


Formally, we define an undirected graph Gy = (Vz, Ex) in 
the k*” chapter, where Vi = {v1, v2, ..., Un, } denotes the set 
of phrases. Ly = {li,le,...,lm, } denotes the lectures in the 
k‘" chapter. We follow the TextRank [13] to construct the 
basic phrase graph for each chapter, which defines an edge 
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We will introduce Prim Algorithm first. 
Prim Algorithm is similar to Dijkstra Algorithm. 


weal | 


W=2 
Prim Algorithm Dijkstra Algorithn 


Figure 2: A sample graph built for a slice of subti- 
tles, which is printed above the graph. 


as if the distance between the offset positions of two phrases 
is less than a preset parameter c (we set it as 8 during the 
experiments). We define the weight of edge as the times 
of co-occurrence between two phrases. Self-loop is allowed 
in our algorithm. The formula for the edge weight between 
phrases v; and v; is 


Wil Vis 0) = S- S- T{dist(vis0;) < ey. 


B= 1 -u, Els,vj Els 


where J is an indicator function and dist(v;,v;) denotes the 
oftset difference between v; and v;. The formula implies that 
two phrases appearing in the lectures more frequently and 
simultaneously result in a higher value of edge weight. For 
instance, Figure 2 shows a sample graph built for a slice of 
subtitle. 


4.2.2. Combinaton 


For teachers usually avoid repeating topics which have been 
discussed before, the relation of phrases will be insufficient if 
we only consider current chapter. For example, considering 
a paragraph of Chapter “Binary Tree”, “We use a queue to 
implement BFS, ..., binary linked list is a way to store bi- 
nary tree.”, the phrases “BFS” and “binary tree” will not be 
connected, unless we combine Chapter “Stack and Queue” 
to connect “queue” and “linked list”. ‘Thus, when phras- 
es propagate information over the graph, some important 
phrases do not associate with each other directly, but build 
an path through some “hubs”. Based on these considera- 
tions, in order to supplement more relationships in current 
phrase graph, we combine it with those of previous chapters. 
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Therefore, we propose a weighted method for the combina- 
tion of graphs. Specifically, when we rank the phrases in 
a chapter, we combine the current phrase graph with those 
constructed by all other chapters that have been released. 
We sum the weights of two phrases in different graphs by 
utilizing a damping factor a, which gives a lower weight to 
an earlier chapter. Formally, edge weights in the k’” chapter 
are formulated as 


Wi. (vi, vj) = 


4.2.3 Computation 

The PageRank value transferred from a given node to the 
targets of its neighbors upon the next iteration is divided 
by all adjacent nodes, according to their edge weights. We 
set the number of iteration times as 20, which is enough 
to ensure the convergency in our experiments. And we set 
the damping factor d to 0.85, which is represented as the 
transition probability. For each chapter, the output of this 
model is a ranked list of phrases with the PageRank value. 


Formally, the iterative process can be described as the fol- 
lowing equations. We first antehice all phrases with the 
same value as PR;(v;;0) = a where N is the total number 
of nodes. At each time step, the computation yields 


ow S- PRE a; t)Wx (vi, 5) 


PRz(vi;$ +1) = Won)’ 
vj €M (v4) davseM(v;) We (Vis Us) 


Ne 


where PRz(v;;t) denotes the PageRank value of v; at time 
t in the k’” chapter, and M(v;) denotes the set of nodes 
adjacent to v;. The computation process ensures that the 
sum of overall PageRank values identically equals to 1 at 
any time step. 


4.3 Measuring the Importance of Topics 
However, PageRank method only concerns about relative 
importance and exaggerates the difference between top phras- 
es. To avoid the situation where one phrase plays a dom- 
inant role on the importance of topics, we propose three 
commonly-used distributions to smooth the result: linear 
function, sigmoid function and Gaussian function. The gra- 
dient of these functions are more gentle, so as to alleviate 
the “slump” at first several phrase importance in the origi- 
nal ranking. The comparison of the phrase importance dis- 
tribution between original PageRank value and three new 
functions is shown in Figure 3. 


Thus, we have got a ranking of phrase importance with a 
more gentle slope. We multiply the phrase distribution of 
topics and the vector of phrase importance. The product 
can be viewed as the importance magnitude of the topics in 
this chapter. The formula is shown as: 


Imp(Topic) = S- 


phrase€T opic 


Imp(phrase)F (p(phrase)), 


where p(phrase) denotes the probability of phrase occurring 
in Topic and F' denotes one of the transfer functions. Even- 
tually, we sort the topics by their importance, and output a 
ranked list of topics as the final result of this chapter. 
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Figure 3: The comparison of the distributions of 
phrase importance between original PageRank value 
and three transfer functions that aims to smooth the 
result of original ranking. 


5. EXPERIMENTS 


In this section, we evaluate our framework by identifying 
the most important topics for each chapter. We examine 
the performance of our algorithm by comparing with four 
baselines. The ground truth labels come from the problem 
set annotated by three domain experts. Three metrics are 
used to evaluate the effect of our ranking algorithm. 


5.1 Setups 


Our framework first generates several topics from the sub- 
titles in each chapter. Then, we compute the importance of 
these topics by our algorithm and get a ranking list. These 
topics are also sorted by ground truth labels, which leads to 
an ideal ranking. Based on these two rankings, we then com- 
pute the metric score of our ranking in this chapter. At last, 
we take the average among chapters as the performance of 
our algorithm. Besides, we also try different variants of our 
algorithm by taking different transfer functions and altering 
the damping factor. 


5.2. Baseline Algorithms 

To evaluate the performance of our algorithm, we take four 
commonly-used strategies as baselines to rank the impor- 
tance of phrases: (1) Random; (2) Bag-of-Words; (3) TF- 
IDF; (4) TextRank. For the comparability, these baselines 
also adopt the topics generated from LDA model as ranking 
items. 


Random Strategy simply ranks the topics by random selec- 
tion. Bag-of-Words Strategy views the frequency of each 
phrase as the importance in a certain chapter. One short- 
age of the Bag-of-Words is that some phrases having a high 
raw count in every chapter do not obviously overweigh than 
other phrases. T'F-IDF Strategy is a numerical statistic that 
addresses this problem by weighting the phrase frequencies 
through the inverse of document frequency. TextRank Strat- 
egy in our experiments is followed by [13], which leverages 
neither previous chapters nor transfer functions. 


5.3. Ground Truth and Metrics 


For students who want to complete the course are more likely 
to finish the quizzes and exams [2, 10], we think they pay 
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Type Algorithm nDCG MAP TB 


Random 0.838 0.586 0.000 
accline BoW 0.867 0.631 0.007 
TF-IDF 0.850 0.580 -0.039 


TextRank 0.869 0.640 -0.010 
PR-Linear 0.871 0.645 0.211 

~  PR-Sigmoid  ~—0.883—s«<0.649 (0.256 
PR-Gaussian 0.878 0.613 0.144 
Ours a-PR 0.900 0.749 0.263 
a-PR-Linear 0.920 0.752 0.237 
a-PR-Sigmoid 0.917 0.761 0.266 
a-PR-Gaussian 0.906 0.747 0.255 


Table 2: The comparison of performance between 
four baselines and our algorithm. For all metrics, a 
higher value means a better performance. 


a higher value on the topics which count for more in the 
assignments. Thus, in this paper, we define the importance 
of a topic as “the number of problems that involve this topic”. 


Three domain experts in computer science independently 
annotated the relevance between the problems and the top- 
ics. Specifically, given the problem set and the topics we 
generated, raters labeled each topic with all the problems 
whose content is related to this topic. The Cohen’s Kappa 
for the annotations was 0.535 (in the range of [—1, 1]), which 
indicated moderate agreement on inter-reliability. Consider- 
ing the different understanding of generated topics between 
raters, we took the union set of problems selected by three 
raters as the final result. Then, we define the number of 
problems in this set as ground truth. ‘This process induces 
a human-generated ranking, which is then compared to the 
ranking computed by our algorithm. We use three kinds of 
metrics to evaluate the effectiveness of our ranking algorith- 
m: nDCG, MAP and Kendall’s 7, which are widely used for 
ranking model. 


5.4 Results 


5.4.1 Performance Comparison 

Table 2 shows the comparison of performance between base- 
lines and our algorithms. We report seven variants of our 
algorithm, which differ in whether combines previous chap- 
ters as additional information and which transfer function 
is used for smoothing. We find that all the variants out- 
perform the baselines. The best variant (a-PR-Sigmoid) 
yields a 18.9 percent boost of MAP score, compared with 
TextRank. The results also show the consistency among d- 
ifferent metrics. Besides, the methods which combine the 
content of previous chapters have a significant improvemen- 
t, compared with those not combine. In addition, we find 
the transfer functions effective no matter whether or not the 
method combines the previous chapters. 


We then discuss the possible reasons why our algorithm- 
s beat the baselines, especially Bag-of-Words and TF-IDF. 
Firstly, we think PageRank methods leverage the relation 
between phrases. The PageRank method suggests that the 
phrase is important if the neighbors linked to it are impor- 
tant, so that an important phrase can be explored even if it 
does not occur so often. Then, combining previous chapters 
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Figure 4: The change of nDCG in different PageR- 
ank variants, with a tuned from 0 to 1. 


provides the phrase graph with richer structure information. 
One reliable explanation is that some phrases and relations 
not appearing in the current chapter play a role as “hub- 
s” that connect two important phrases. At last, transfer 
functions alleviate the bias from PageRank. For the impor- 
tance of top phrases have been exaggerated in PageRank, 
the topics having these phrases with a higher probability 
will surpass the others. 


5.4.2. Parameter Analysis 

When we combine graphs of previous chapters, the damping 
factor a should be preset. The analysis of a is shown in 
Figure 4. The situations are almost consistent when using 
different metrics. Note that when a equals to 0, the method 
will degrade into those not combining the previous chapters. 


We observe an interesting phenomenon that as a tuned from 
0.05 to 1.00, the performance trends downward when us- 
ing transfer functions, while the performance remains un- 
changed in most of the time, but has an increase at 1.00 
when using PageRank value directly. Therefore, during the 
experiments in Table 2, we set a to 0.05 if we use a transfer 
function for smoothing and set it to 1.00 otherwise. Because 
when using a transfer function, a lower value of a enables 
the current graph to enrich the structure information with- 
out influencing the relation between phrases. However, when 
using the original value, the importance of top phrases were 
exaggerated, so that a was set as 1.00 to “dilute” the effect 
of top phrases. 


6. DISCUSSION 


The experiments have shown the performance of ranking 
the topic importance within chapters, which is useful for 
students to know the emphasis of upcoming lectures. More- 
over, when students prepare for exams, our framework can 
also guide students according to their learning status. We 
assume that two students ($4 and Sg) are preparing for 
the mid-term exam, including 8 chapters. S4 have learned 
all the content well, while Sig is deficient in “Linear List”, 
“Queue and Stack”, “Binary Tree Application” and “Tree and 
Forest”. We take all subtitles as inputs for S,4, so that we 
can design a overall review plan. While we just take sub- 
titles in those four chapters as inputs for Sg, in order to 
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Topics for Si4 Topics for Sg 
logical structure sequential list 
complete binary Te 


binary search tree 
binary tree structure | binary tree structure 
binary tree traversal 


Table 3: The top five topics for S4 and Sg. Each 


topic is concluded with one phrase. 


concentrate on the topics among weak points. The results 
are shown in ‘Table 3. 


Case, shows that our algorithm suggests topics about “bi- 
nary tree” as the most important content. In fact, the tree 
structure is indeed the most important in the first half of the 
course, for three chapters introduce the foundation, applica- 
tion, and extension of binary tree separately. In Cases, our 
algorithm puts more emphasis on “linear list”. One reliable 
explanation is that linear list is a fundamental data structure 
and the instructor frequently mentions it when introducing 
the implementations of queue, stack, tree structure. 


7. CONCLUSION AND LIMITATION 


In this paper, we proposed a novel framework to provide 
guidance for MOOC students before their learning. Our 
method first generated topics from video subtitles by LDA 
model. Then, we ranked the importance of phrases based 
on a particular PageRank method. At last, we smoothed 
the PageRank value and measured the importance of topics. 
As the result, we displayed the most important topics of the 
upcoming chapter. Experiments showed the effectiveness of 
our algorithm according to three metrics. 


Several factors limited the findings of our study. One was 
the diversity of our dataset, which included only one sci- 
entific course. However, it is time-consuming to label the 
topics with the problems, and the annotations have to be 
done by domain experts. Another limitation was lack of 
real personalized guidance. We have considered to further 
our study by understanding student learning behaviors and 
including such information into the phrase graph. Nonethe- 
less, the main objective of our study is to introduce such a 
novel framework that can provide guidance for students at 
the early stage of their learning process. 
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ABSTRACT 


We study the problem of partitioning a class of N students 
into k groups of n students each (N = k x n), such that 
their learning from peer interactions is maximized. In our 
formalization of the problem, any student is able to increase 
his score in the subject the class is studying up to the score 
of the student who is at p-percentile among his higher ability 
peers. In contrast, the past work presumed that only stu- 
dents with score below the group mean may increase their 
score. We give a partitioning algorithm that maximizes to- 
tal gain summed over all the students for any value of p such 
that 100/(100 — p) is integer valued. The time complexity of 
the proposed algorithm is only O(N log N). We also present 
experimental results using real-life data that show the supe- 
riority of the proposed algorithm over current strategies. 


1. INTRODUCTION 


A basic problem that has challenged educators for a long 
time is how to group students in a class in order to supple- 
ment their learning from the teacher with the learning from 
peers [6, 11]. Two popular strategies currently in vogue 
are: i) heterogeneous (also called diversity-based) grouping, 
and ii) homogeneous (also referred to as stratified or ability- 
based) grouping [5]. Both have their ardent proponents. 
The results from the empirical studies on the relative effec- 
tiveness of the two are inconclusive and the public opinion 
has also been mixed [3, 9]. 


In a major departure from the conventional thinking, a com- 
putational perspective was taken to address this problem 
in [1]. However, the learning model underlying the proposed 
algorithmic approach postulated that only the below average 
students are able to increase their ability score [4]. This pa- 
per removes this limitation, recognizing that every student 
can benefit from peer interactions [6, 8]. 


1.1 Contributions 

e We admit a general learning model that specifies that any 
student is able to increase his ability score up to the level 
of the student who is at p-percentile amongst his higher 
ability peers. The value of p is an input parameter, se- 
lected by the educator. The model in [8] can be viewed as 
a special case, with p set to 100. 


e For the above learning model, we provide an algorithm 
for partitioning N students into k groups of n students 
each (VN = k x n) with the goal of maximizing learning 
gain summed over all the students. We show that the 
algorithm is optimal for the values taken by p such that 
100/(100 — p) is integer-valued. Thus, it is optimal for p € 
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{99, 98, 95, 90, 80, 75, 664, 50}. The time complexity of 
the algorithm is O(N log N). 


e We present experimental results using real datasets, show- 
ing the superiority of our approach over current strategies. 


1.2 Limitations 


e Although our learning model has been abstracted from the 
findings in the education literature, a rigorous empirical 
validation of the model is future work. The insights gained 
are nonetheless instructive. 


e ‘Teaching others and giving help has been shown to be pos- 
itively correlated to increase in learning [2]. Incorporating 
such learning gains for high ability students is future work. 


2. RELATED WORK 


The question of how to group students to maximize their 
gain from peer interactions was first addressed from a com- 
putational perspective in [1]. The authors proposed two 
functions to model learning gains. The first maximizes the 
number of students who improve their ability score [4], while 
the second incorporates the extent of these improvements. 
In both the cases, however, only the below average stu- 
dents benefit and the higher ability students have zero gain. 
The authors showed that the partitioning problem with the 
goal of maximizing the number of benefiting students is NP- 
complete, while they left open the question of the complexity 
class of the problem with the second gain function. 


The viewpoint that every student can learn from the higher 
ability peers is also present in [8]. In their model, every 
student may increase his ability to a fixed level, which is 
the ability of the highest ability student, i.e. p = 100. This 
assumption is too rigid and optimistic. In contrast, we admit 
various levels of gain for different students. 


Our problem bears resemblance with the expert-team for- 
mation problem, in which the experts are multi-dimensional 
vectors of skills and the goal is to find a team that can collec- 
tively perform a given task requiring certain skills [10]. How- 
ever, our students are described by 1-dimensional scores, and 
our objective is not to locate a single team, but to partition 
the students such that their learning gain is maximized. 


Our problem also superficially resembles the classical clus- 
tering problem [7]. However, unlike the classical clustering, 
which aims to maximize the similarity of all the points in a 
cluster to a cluster center, our problem has no one point in 
a partition with respect to which the distance of all other 
points needs to be optimized (see Fig. 1). 


156 


ABILITY SCORES 


AHHH EHOHA Ge 


4.0 

5.0 

6.0 

6.0 

7.0 


2.0 


3.0 


3.0 


Total Learning Gain = (0 +14+2+34+34+4+4+5+4+6+4+6+/7) = 37 

Figure 1: Computation of the potential learning gain for a group of ten students with 75-percentile chosen as the reference point. The 

it? box contains the score of the i*” student. The learning gain for each student is the difference between his score and the score of 
student at p-percentile amongst his peers having higher score than him. For the first student, the index of the student at 75-percentile 

amongst his higher ability peers is (1 + [(10 — 1) * 75/100]) = 8. Since the score of the latter is eight, the gain for the first student is 

(8 — 1) = 7. For the second student, the index of the student at 75-percentile amongst his higher ability peers is also 8 
(2 + [(10 — 2) * 75/100]), thus giving him a gain of (8 — 2) = 6, and so on. The gain for the last student is zero, as there is no one 
above to learn from. 


3. PROBLEM STATEMENT 


We have a class of N students. Each student 7 is associated 
with score 0; € Rso, representing student’s ability in the 
subject the class is studying [4]. For simplicity, scores are 
assumed to be distinct, so there is a one to one correspon- 
dence between the student 7 and the score @;. Students are 
ordered in the increasing order of scores. 


Students are able to increase their score through interactions 
with peers in the group in accordance with a gain func- 
tion [12, 13]. The gain from peer learning for a group G is 
given by a function £. Our objective is to find k groups of 
n students each (N = k x n), such that the overall gain for 
students is maximized. ‘That is, our objective is 


max S- L(G). (1) 


GEG 
The learning function is of the form 
|G| 
L(G) = >> (RE -41), (2) 
i=1 


where R© is the reference score for the G’s i*” ranked stu- 
dent. The intuition is that each student can increase his 
score up to the reference score. 


3.1 Learning up to p-Percentile 
PROBLEM 1 
The gain function in Eq. 2 is given by 
|G| 
L(G) = > (vi? - 68), (3) 
i=1 
where p; 1s the score of the student whose score is at the p- 
percentile position of the scores of the students having higher 
score than the i*” student in G. 


For a given set of scores, the p-percentile score is the score 
below which p% of scores fall. To find the p-percentile 
score, the corresponding index is calculated first, which is 
[np/100]. The value at this index then is the p-percentile 
score. Thus, 


p-percentile(01, 02,...,9n) = 9fn.p/100)- (4) 


Fig. 1 graphically illustrates the percentile gain function. 
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(P-PERCENTILE PARTITIONING PROBLEM). 


4. SOLUTION 

THEOREM 1. For values of p such that p/(100 — p) is 
integer-valued, the p-Percentile Partitioning problem can be 
solved optimally in O(N log N) time. 


We shall prove the theorem constructively by providing an 
optimal algorithm whose time complexity is O(N log NV). It 
is named Percentile Partitions and its pseudo-code is shown 
in Algorithm 1. The algorithm exploits the special structure 
of our problem that we elicit next. 


We first expand the equation for learning gain w.r.t. p- 
percentile as given in Eq. 3 into 


em(es = (p-percentile(6y’, 08", toa On ) 7 0") as 
(p-percentile(6$, af wes 0°) — 0$') + 
+ (p-percentile(4y )— On-1) 


Using the definition of p-percentile from Eq. 4, the above 
can be written as 


LP (G) = (OT: 7(n—1)p/1001 91) + (024 ((n—2)p/100) 92) + 
+ (07 — 651). 


To this we add the term (089° — 0@) corresponding to zero 
gain of the n™ student. Thus, we have 


LP (G) = (60: 7(n—1)p/1001 91) + (054 ((n—2)p/100) 92) + 
sa OP (80,—1)4[p/1001 -9x-1) + (67-60%). 


Collecting the positive and negative terms together, we get 


L°(G) = (99%. r(n—1)p/2001 a Oct 100) te aa 
+ On—-1)+1p/1001 + On) 


= (OF +Or +...4+0%1+09), 
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which can be written succinctly as 


L(G) = > OF t(n—apsi00) — DF (5) 
t=1 1=1 


Using this equation, our objective becomes 


mgs (SoA rennnnon ~ 09), 
4=1 


GEG \i=1 


The second component in the above sum is constant for any 
given set of ability scores. Therefore, our objective can be 
simplified to 


G 
aia > S- 954. [(n—é)p/100]- (6) 
GEG i=1 


LEMMA 1. Given p € [0,100] and an ascending sequence 
of 0; € Rso, for (100—p)|100, So", 6:4 ;(n—1)p/100) 18 equiv- 
alent to >>", yi - i, where 


100 ; | np ] ; 
T00—p? if lig} <tSn 
as 100 -¢ 100 = [22] 
Vi = 4 mod(n, 700-0) F ogee {nand i= | 7h 
0, otherwise. 


PROOF. It is to be noted that a student at index 2 im- 
proves up to the score of student at index i+ |(n—i)p/100]. 
As the student indexes are traversed from the higher-score 
end to the lower end, with unit decrease in value of 7, the 
quantity [(m — 7)p/100] increments by unity, except for the 
values of i for which (n —7)p is a multiple of hundred. In the 
latter case, although there is a decrement in the value of 7 
by one, the value of [(n —i)p/100] stays the same as that of 
[(m —%—1)p/100], causing the index up to which students 
are improving to decrement by one. It is easy to derive that 
this process repeats itself after a period of 100/(100 — p). 
Further, when n is not a multiple of the above period, there 
will be mod(n, 100/(100—p)) students who will be improving 
up to the smallest index value. For the remaining students, 
as no other student improves up to their score, a y value of 
zero is straightforward. ([] 

EXAMPLE 1. In Fig. 1, we have n = 10 and p = 75. 
Thus, in accordance with Lemma 1, we have 


4, if8<i<10 
ifi=8 


0, otherwise. 


The above may also be verified visually from Fig. 1. It 1s easy 
to note that the students at 7, 8%, and 9" index improve 
up to the score of the 10°” student, while the 10 student 
with zero gain remains at the same score. This makes the 
score of the 10° student visible four times in the updated 
scores, leading to the y value of four. Similarly, the score 
of the student at 9°" index is also visible four times because 
of students at 3", 4°, 5, and 6” indexes improving up to 
his score. On the other hand, only students at 1° and 2"4 
indexes improve up to the score of 8°" student. Hence, a7 
value of two for the 8 student. No one is improving his 
score up to the score of any of the students at index below 
eight. So, the y values corresponding to them are zero. 


Unfortunately, when (100—p) + 100, the coefficients +;’s have 
complex structure and we defer their study to future work. 
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Algorithm 1 (Percentile_Partitions) Optimal Partitioning 
for maximizing Learning Gain - learning up to p-percentile 
1: Input: Distinct descending scores {61,62,...,@n}, Per- 


centile p, Number of groups k, Size of each partition n, 
xn WN: 


2G) = G3 ]242 = 6,0 

3: m < 100/(100 — p) 

4A: q+ |[n/m| 

5: @< [n/m] 

6: if mod(n,m) 4 0 

7: Me AO Ro Koes GO ngaRy y 
8: fori € {1,2,...,k 

9: Gi«+GiUM; 


10: end for 
11: end if 


12: HH lcisbai <— {01,02,...,9kq} 
13: 2 giobal <= (Ondat si ON as On | 
14: for i € {1,2,...,k} 


15: FA lpart <- randomly sample q_ scores from 
A1giobai Without replacement. 
16: H2part <— randomly sample (n — g) scores from 


A2giobai Without replacement. 
17: G; <= G; U Apart U ae 
18: end for 
19: return {G1,G2,...,Gx} 


4.1 Percentile Partitions 

Lemma 1 leads to our optimal partitioning algorithm, which 
is shown in Algorithm 1. The algorithm first divides the 
input ability scores into two or three sets depending on 
whether mod(n, 100/(100 — p)) is zero or not respectively. 
The first set H1giobal consists of scores that contribute by 
a factor of 100/(100 — p) to the learning gain. The second 
set MV if present, consists of scores that contribute by a fac- 
tor of mod(n, 100/(100 — p)). Finally, the third set H2g1obai 
consists of scores that have zero contribution. These sets 
correspond to the three different values of the y coefficients. 
They are such that H1giobat = M = A2giodal, where A = B 
means all elements of set A are greater or equal compared 
to any element of set B. For each of these sets then, the 
algorithm creates k equal random partitions. These parti- 
tions are then merged to create the final k partitions. The 
example below illustrates the algorithm. 


EXAMPLE 2. Consider a set of 20 students with ability 
scores {01,02,...,020}, sorted in the descending order. The 
set is to be partitioned into four groups, each containing five 
students. Each student can learn up to the score of the stu- 
dent who is at 66 2 -percentile of students above. 


For p = 664 and n = 5, we have m = 3, q= 1, and q = 2. 
The algorithm breaks the scores into three sets: 

FA 1 giobal = {61, Oo, 3, 64} 

M = {05, 06, 07, 08} 

A 2giobal = {99, 910, 911, 912, 413, O14, 915, 916, 17, 918, O19, 820} 


For each set, four equal-sized random partitions are created, 
which are then merged to create four groups: 

Gi = {03} U 106} U 1017, O10, O15} 

Go = {0:1} U {67} U {619, 16, Ao} 

= {02} U {45} U {613, 018, 012} 

{Oat U {Os} U 1014, O20, O11} 


Qn 
| | 
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Note: There are many equally good ways of partitioning 
A gtovai, M, and A2gicbat. The above is just one of them. 


4.2 Proof of Theorem 1 

Clearly, if the input scores were already in the descending 
order, the time complexity of the Algorithm 1 is O(N). If 
the input scores were unsorted, then the extra sorting step 
would make the complexity O(N log N). 


The optimality of the algorithm follows from the structure 
in the values taken by the coefficient y’s. Before proceeding 
further, we state the following lemma: 


LEMMA 2. For given ordered sets of real numbers, A = 
{a1,a2, ..-, Qn} and B = {bi,b2,..., bn}, the quantity 
acdc 0), 8.t. eacha € A andb € B is used exactly 
once, is maximized if the elements are chosen in a manner 
such that the product of elements at the same index from A 
and B is taken. 


Now, according to Lemma 1, 7; can take only one of the 
three values and they have ordering amongst them given by 
100/(100 — p) > mod(n, 100/(100 — p) > 0. The partitions 
created by the algorithm satisfy, H1lgiobai = M = A2giovat. 
Thus, in light of Lemma 2, it is easy to observe that our ob- 
jective is maximized as the set of students with higher(lower) 
scores get mapped to highest (lowest) coefficient. Moreover, 
the random perturbations within H1giobai, M, or A 2giobal 
do not affect the gain value as all the scores from a set are 
involved in product with the same y value. 


5. EXPERIMENTS 


5.1 Datasets 

1. SSC Scores (Normal distribution): Staff Selection 
Commission - Combined Graduate Level Examination (SSC- 
CGL) is conducted all across India to recruit employees for 
various departments of Government of India. The scores of 
candidates for the 2016 examination, categorized into differ- 
ent regions of the country, are available at ssc.nic.in. The 
distribution of scores in every region is close to normal. We 
took the scores from the North Western (SSC-NWR) region 
that exhibits the largest variance. 


2. GATE Scores (Log-Normal distribution): In In- 
dia, Graduate Aptitude Test in Engineering (GATE) is con- 
ducted every year to test the competency of undergradu- 
ate students in various engineering disciplines. We took 
the available scores from year 2016. We experimented with 
scores from Mech. (GATE-MBE), with largest variance. 


3. StkXchg UpVotes (Pareto distribution): On the 
Stack Exchange platform, users can ask and answer ques- 
tions on various topics. Additionally, they can up-vote or 
down-vote a question. The number of up-votes a user re- 
ceives is an indicative measure of his level of expertise. Pareto 
distribution fitted the data for the active users having at 
least one up-vote. The Stack Exchange data dump is avail- 
able from archive.org/details/stackexchange. We take data 
for Stack Overflow that ehibits lowest skew in distribution. 


5.2 Algorithms 


In addition to Percentile_Partitions, we consider two algo- 
rithms that correspond to the strategies currently prevalent 
in practice: Stratified and Random. 
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1. Stratified: This algorithm puts in each group those stu- 
dents who exhibit similar ability. ‘This grouping represents 
the practice of homogeneous or ability-based grouping. 


2. Random: Students are assigned to groups randomly. 
This method corresponds to the practice of heterogeneous 
or diversity-based grouping. 


5.3 Set Up 


We conducted our experiments setting the number of stu- 
dents, N, to 1024. We varied the number of groups, k, over 
{2,4,8,..., 512}, and the reference percentile point p over 
{50, 664, 75, 80, 90, 95, 98, 99}. Thus, for each dataset, we 
randomly sample 1024 scores and generate the groups for 
different combinations of k and p values. In order to have 
tight confidence intervals, we repeat this exercise 30 times 
each and report average learning gain. 


For the groups generated by Percentile_Partitions, we com- 
pute learning gain using Eq. 3. When applying Stratified 
or Random to a dataset, we generate groups only once but 
compute gain using the appropriate parameter value for p. 


We also study the group structures generated by different 
algorithms. By the structure of a group, we mean the dis- 
tribution of scores in the group. Although we run each al- 
gorithm 30 times, we only show the structure of the group 
generated by the first run. 


5.4 Results 


Fig. 2 shows the learning gain as the reference percentile 
value, p, is varied for different algorithms on various datasets. 
We show the plots for three values for the number of groups, 
k € {128,32,8} (and the corresponding group sizes, n € 
{8, 32, 128}). Fig. 3 shows the learning gain as the number 
of groups, k, is varied. We show the plots for two percentile 
values, p € {75,90}. Fig. 4 shows the group structures 
generated by different algorithms. We show the structures 
for groups of size, n = 8, and for the reference percentile, 
p = 75. We alert the reader that different scales have been 
used for Y-axis in Figs. 2-3 and a logarithmic scale has been 
employed for X-axis in Fig. 3 for the sake of clarity. 


We see that the overall behavior of different algorithms re- 
mains similar across different group sizes and reference per- 
centile values. Clearly, Percentile_Partitions consistently out- 
performs the other algorithms that corroborates its theoret- 
ical optimality. The following additional observations are 
noteworthy: 


e With increasing value of p, total learning gain increases 
super linearly (Fig. 2). It is because the extent of learning 
gain for each student increases. The gain plateaus for 
small groups because beyond some percentile value, all 
students improve up to the same highest ability student. 
Then, it does not matter whether the reference percentile 
is at 90 or 95. 


e The advantage of Percentile_Partitions over Random is 
more pronounced when the number of students in a group 
is in a more realistic range of 32 or less (Fig. 3). When the 
number of groups is small and each group is large, Per- 
centile _Partitions assigns very many students randomly 
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and therefore the group structure and gain produced by value of all the undivided scores. However, this pattern 
it become similar to that of Random. is not true for Random. Some groups generated by Ran- 


e The learning gain is worst with the stratified strategy. 
Fig. 4 shows that this strategy produces groups in which 
the students have similar scores. Therefore, the improve- 
ments from peer interactions are small. Fig. 4 also shows 
that the p-percentile value of every group produced by 
Percentile Partitions is higher than the global p-percentile 
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dom have p-percentile to the extreme right of global p- 
percentile. ‘The scores in between the two p-percentiles in 
such groups do not contribute to the total gain. But then 
some other groups end up having smaller scores above p- 
percentile that leads to smaller additions to the total gain. 
Hence, the superior performance of Percentile Partitions. 
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Figure 4: Group structure generated by different algorithms for groups of size 8. Each row in the plots corresponds to a particular 
group and there is a dot for each ability score in that group. The p-percentile score for each group is plotted in black. The vertical red 
line shows the global p-percentile score. The groups are numbered according to the order in which they are generated. Only for 
Percentile Partitions, the p-percentile score for every group is higher than the global p-percentile value. 


6. SUMMARY 


We investigated the important educational data mining prob- 
lem of how to group students in a class to maximize their 
learning gains from peer interactions. We worked with a 
general learning gain function in which every student is able 
to increase his ability score up to the score of the student 
who is at p-percentile amongst his higher ability peers. We 
gave an algorithm which is provably optimal for maximizing 
learning gain, the value of p is such that 100/(100 — p) is 
integer valued. We also studied the performance character- 
istics of the proposed algorithm using real-life datasets that 
corroborated the theoretical analysis and showed its supe- 
riority over the current approaches. Surprisingly, the time 
complexity of optimally grouping N students using our al- 
gorithm is only O(N log N). 
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ABSTRACT 


Automatic assessment of dialogic properties of classroom 
discourse would benefit several widespread classroom ob- 
servation protocols. However, in classrooms with low in- 
cidences of dialogic discourse, assessment can be highly bi- 
ased against detecting dialogic properties. In this paper, 
we present an approach to addressing this imbalanced class 
problem. Rather than perform classifications at the utter- 
ance level, we aggregate feature vectors to classify propor- 
tions of dialogic properties at the class-session level and 
achieve a moderate correlation with actual proportions, 

r(130) = .50, p < .001, CIg5[.36,.61] . We show that this 
approach outperforms aggregating utterance level classifica- 
tions, r(130) = .27, p = .001, CJp5[.11, .43], is stable for 
both low and high dialogic classrooms, and is stable across 
both automatic speech recognition and human transcripts. 


Keywords 
dialogic instruction, questions, authenticity, machine learn- 
ing, imbalanced classes 


1. INTRODUCTION 


Classroom observation for measuring teaching effectiveness 
is currently used in 47 states [1]. Simply stated, classroom 
observation involves a trained evaluator watching how a 
class is taught and using a rubric to score the teacher’s per- 
formance. The widespread use of classroom observation is 
based on previous research which indicates that instructional 
quality has a greater impact on student achievement than 
class size, teacher experience, or teacher graduate education 
[16]. Beyond such research findings, classroom observation 
is also driven by the teacher accountability era coinciding 
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with the passage of the federal No Child Left Behind Act, 
which mandated annual testing of students by all states. In 
this highly politicized environment, classroom observation 
is increasingly being used to determine teacher’s salary and 
tenure. 


Curiously, given the high stakes associated with classroom 
observation, the majority of research linking instructional 
quality to student achievement over the past several decades 
has been correlational only. However there has been an in- 
creasing interest in randomized controlled trials. One re- 
cent randomized trial is the multi-year Measures of Effective 
Teaching (MET), which tracked approximately 3,000 teach- 
ers in seven states [4]. In year 1, MET researchers built 
predictive models of teaching effectiveness, and in year 2, 
teachers were randomly assigned to new classrooms to test 
the predictive models from year 1. Major MET findings were 
that teaching effectiveness measured via classroom observa- 
tion protocols correlated with achievement gains and that 
question asking behavior was a key component of variability 
in teaching quality [11]. 


Although instructional quality is linked to achievement, the 
current practice of assessing instructional quality through 
classroom observation is logistically complex and expensive, 
requiring observer rubrics, observer training, and contin- 
uous assessment to maintain a pool of qualified observers 
[2]. To address these practical challenges, our work has fo- 
cused on the automated assessment of classroom discourse, 
with a particular emphasis on measuring dialogic questions 
in classrooms. Our approach is to automate an existing, 
fine grained classroom observation protocol that focuses on 
dialogic questions, known as the Classroom Language As- 
sessment System! [13]. Unlike the classroom observation 
protocols used in the MET study, in which an observer 
makes rubric-based judgments approximately every 10 min- 
utes, CLASS uses fine-grained coding at the question level, 
creating suitably detailed labeled data for machine learning 
purposes. 


‘CLASS denotes the CLASS created by Nystrand and col- 
leagues, as opposed to the CLASS used in the MET study. 
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The dialogic instruction measured by CLASS is character- 
ized by open-ended discussion and the exchange of ideas (cf. 
[3]), which in turn are characterized by questions that truly 
seek information (authentic questions) and which incorpo- 
rate ideas from the student (questions with uptake). For 
example, “How did you feel by the end of the story?” is an 
authentic question because there is no pre-scripted response, 
and a follow-on question “Why do you think that is?” has 
uptake because “that” refers to the student’s previous reply. 
As is clear in these examples, dialogic properties are con- 
textualized by the discourse such that the antecedents and 
consequents of the question shape whether a question is au- 
thentic or has uptake. Previous research using CLASS has 
shown that authenticity and uptake are significant predic- 
tors of student achievement [10, 9, 14]. 


Our project, which we call CLASS 5, seeks to fully auto- 
mate classroom observations under the CLASS protocol. In 
our work, we have used archival data collected in previous 
CLASS projects, containing human transcripts of dialogic 
questions, as well as new data using automatic speech recog- 
nition (ASR) of teacher speech. Models built with archival 
human transcript data are as effective at classifying authen- 
ticity and uptake as humans on isolated questions [18]. How- 
ever, as we began to analyze the new CLASS 5 data, we re- 
alized that there were two serious limitations undermining 
our existing models. First, the archival data used in pre- 
vious work [18, 17] contained only transcripts of questions, 
and even these did not represent all questions but a subset of 
questions that were instructional, and so excluded rhetorical 
questions, procedural questions, and discourse management 
questions [13]. In the archival data, approximately 50% of 
the questions were coded as authentic questions. In con- 
trast, the new CLASS 5 data included all questions and non- 
questions, i.e. all utterances, from which authentic questions 
must be detected. Secondly, in the CLASS 5 data, the base 
rates for dialogic properties were dramatically lower than in 
previous samples. For example, authentic questions in our 
new data collection constituted about 30% of instructional 
questions compared to approximately 50% of instructional 
questions in the archival data; moreover, authentic questions 
in our new data constituted only about 3% of all utterances. 
Therefore to be robust in detecting dialogic properties across 
samples, our models must be able to deal adequately with 
imbalanced classes. 


The so-called “class imbalance problem” is well known in the 
data mining community, and has been proposed as one of 
data mining’s top 10 challenging problems [20]. The essence 
of the problem is that a classifier can maximize accuracy by 
always selecting the majority class and that this strategy, 
typically considered as a baseline for performance, becomes 
increasingly hard to beat as the majority class distribution 
approaches 100%. A review of the class imbalance problem 
describes three major approaches for addressing it [8]. First, 
algorithmic approaches may be used to bias learning towards 
the minority class. Secondly, preprocessing methods may 
change the class distribution before learning occurs, either 
by undersampling the majority class or oversampling the 
minority class. ‘Thirdly, cost-sensitive approaches may be 
used to assign higher costs, or weights, to minority class 
errors, such that the learning algorithm tries to minimize 
the total cost. 
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In this paper, we present another method for addressing the 
class imbalance problem, which is to transform the problem 
into a different problem that is easier to handle. Specifi- 
cally, we explore the consequences of shifting from classifiers 
that classify utterances as authentic questions to classifiers 
that classify the proportion of authentic questions in a class 
session. As will be shown in the remainder of the paper, 
this problem transformation outperforms aggregating utter- 
ance level classifications, is stable for both low and high dia- 
logic classrooms, and is stable across both automatic speech 
recognition and human transcripts. 


2. METHOD 


2.1 Data sets 

CLASS 5 data. New data for the CLASS 5 project were 
collected between January 2014 and May 2016 at seven 
schools in rural Wisconsin. Observations for 132 class ses- 
sions taught by 14 different teachers were manually coded 
using the CLASS system, and audio was simultaneously 
recorded. Both teacher and school identifiers were preserved 
with the data. Given the logistical constraints of individ- 
ual microphones for each student, the recording instrumen- 
tation instead focused on high quality teacher audio suit- 
able for ASR that was recorded using a wireless microphone 
headset. Classroom audio, which included both teacher and 
student speech, was recorded from a stationary boundary 
microphone, and was not of sufficient quality to be used for 
ASR; however, it is useful for marking when students speak. 
The teacher audio was later automatically segmented into 
utterances and then submitted to a speech recognition ser- 
vice [6]. Thus this dataset differs from the archival data (see 
below) in that the transcripts are provided by ASR with 
its accompanying errors, only teacher speech is transcribed, 
and the transcripts contain all utterances rather than just 
instructional questions. The data contained 45,044 utter- 
ances, of which 1282 were authentic questions (3% of utter- 
ances; 30% of instructional questions) and 290 were ques- 
tions with uptake (.01% of utterances; .07% of instructional 
questions). Authenticity and uptake are even more highly 
related in this data set than in the archival data since only 
5 questions have uptake without authenticity. Given the 
small number of observations of uptake and the finding that 
virtually all questions with uptake are also authentic, we 
primarily focused on detecting authenticity. 


Archival data. The archival data was collected during the 
Partnership for Literacy Study (Partnership), a study of pro- 
fessional development, instruction, and literacy outcomes in 
middle school English and language arts classrooms. The 
Partnership collected data from 7th- and 8th-grade English 
and language arts teachers in Wisconsin and New York from 
2001 to 2003. Over that two-year period, 119 classes in 
21 schools were observed twice in the fall and twice in the 
spring. Teacher identifiers were not embedded in the CLASS 
data files, and out of 119 teachers only 70 could be unequivo- 
cally matched to data files. However, school identifiers were 
directly embedded in data files. Classroom observations for 
Partnership were also conducted using the CLASS annota- 
tion system [13]. During this process instructional ques- 
tions were transcribed, and the transcriptions were mostly 
accurate but not verbatim. Reliability studies using CLASS 
indicate that raters agree on question properties approxi- 
mately 80% of the time, with observation-level inter-rater 
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correlations averaging approximately .95 [14]. After remov- 
ing questions with partially incomplete annotations, 25,711 
instructional questions remained for use in our analyses, of 
which 12,862 were authentic questions (50%) and 5,489 were 
questions with uptake (22%). Authenticity and uptake were 
highly related: only 593 (2%) questions had uptake without 
authenticity. 


2.2 Features 

In early work, we established that word and part-of-speech 
features that are useful for classifying types of questions [15] 
were also useful for predicting dialogic question properties 
like authenticity and uptake [18, 17]. In the present work 
we have extended these 36 predictive features to include 
features obtained through syntactic and discourse parsing 
[12, 19]. At the word level, these new features include 45 
part-of-speech tags as well as named entity type, which sub- 
divides real world objects described by proper nouns into 
13 classes including PERSON, LOCATION, and DATE. At 
the sentence level, the features include 47 syntactic depen- 
dencies like subject, agent, direct object, or indirect object. 
And at the discourse level, the features include 18 discourse 
relations including contrast, elaboration, and topic-change, 
as well as features for joint, nucleus, and satellite elemen- 
tary discourse units. Because the discourse parse returns 
a tree of elementary discourse units, the discourse features 
were mapped to the sentence level by summing the discourse 
relations, satellite, joint, and nucleus features that occur 
in each elementary discourse unit composing the sentence. 
Anaphora resolution was converted into four features includ- 
ing the number of coreference chains in an utterance extend- 
ing into future sentences, the sum of those chain’s lengths, 
and the same features in the backwards direction. In other 
words, the anaphora features capture how well a sentence 
was connected to other sentences in both directions. While 
all features were encoded at the sentence/utterance level (i.e. 
a count of the feature in the utterance), the 36 question fea- 
tures used in previous work were additionally encoded as oc- 
curring at either the first token or after the first token. For 
example, if a definition keyword feature occurred in the first 
token, then that would be recorded as a single count in the 
corresponding overall feature and the first token feature, but 
not in the corresponding after the first token feature. Ad- 
ditionally, the named entity PERSON feature was encoded 
with first token and last token variants based on the obser- 
vation that questions addressed to students typically use the 
name at the beginning or end of an utterance if at all. With 
the positional variants, there were 242 linguistic features in 
our models that span word, sentence, and discourse levels. 


To generate these features we used the CLU processor, which 
contains syntactic and discourse parsers [19]. Because dis- 
course parsing requires a discourse context, utterances for 
each classroom observation were grouped into separate files 
before parsing. The parsers were configured with a maxi- 
mum sentence length of 120 words, which was empirically 
determined by observing the lengths of a subsample of ut- 
terances. Parses for each class-level file were converted into 
utterance level features and aggregated into a 242-dimension 
feature vector where the value at each position was the fre- 
quency count of a particular feature in that utterance. Mod- 
els built at the question level for archival data or utterance 
level for new data used these 242-dimension feature vectors. 
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Models built at the class-session level used these features 
but summed them over all questions (Partnership) or utter- 
ances (CLASS 5) in a given class. Models at the class-session 
level additionally added the means and standard deviations 
of these summed feature vectors, for a total of 726 features. 


2.3 Model training 


Cross validation. We used cross validation such that a 
given teacher would not appear in both the training and test- 
ing folds, in order to study generalizability to new teachers. 
For the CLASS 5 data, this was achieved using leave-one- 
teacher-out cross validation. For the archival Partnership 
data, the mapping between teachers and data files was in- 
complete and so the mapping between schools and data files 
was used instead. This leave-one-school-out cross validation 
makes the assumption that a teacher did not transfer be- 
tween schools during the study (a likely assumption) and 
in a sense is even more conservative than leave-one-teacher- 
out validation because it controls for similarities shared by 
teachers at the same school. Ideally the same cross vali- 
dation technique would be used for both data sets, but for 
CLASS 5 data there aren’t enough schools (2) and for the 
Partnership data the teacher identifiers are incomplete. 


Models. Different models were used depending on the na- 
ture of the task and the class imbalance. For question-level 
authenticity prediction in the archival Partnership data, 
where classes are balanced, a J48 decision tree was used. 
J48 models were chosen because of their previous perfor- 
mance on this task and data set [18]. For utterance-level 
authenticity prediction in the new CLASS 5 data, where 
classes are highly imbalanced, SMOTEBoost was selected 
[5]|. SMOTEBoost combines oversampling of the minority 
class by synthesizing new exemplars (SMOTE) with boost- 
ing, which builds a serial ensemble of models such that each 
successive model increases the weight, or focus, to instances 
misclassified in the previous model. SMOTEBoost applies 
SMOTE in each of these successive models in order to im- 
prove accuracy over the minority class, and evidence sug- 
gests it is one of the best all-purpose algorithms for imbal- 
anced problems, though not necessarily the fastest [8]. Sev- 
eral other algorithms were evaluated on this task, including 
k-nearest neighbors, random forests, various cost-sensitive 
classifiers, and various ensembles, but SMOTEBoost had 
the best utterance-level performance. For class-level au- 
thenticity prediction (for both Partnership and CLASS 5 
data), M5P model trees, which are decision trees with re- 
gression functions at the leaves [7], were used to predict the 
proportion of authentic questions in the class period. As 
a comparison to the class-level models, we aggregated over 
the question- and utterance-level classifications to calculate 
a proportion score at the class level. 


3. RESULTS & DISCUSSION 


3.1 Proportion models for imbalanced data 

Our first comparison was between class-session level pro- 
portion models and aggregated utterance level classifica- 
tions for the new CLASS 5 data where authenticity was 
very rare. A M5P model trained to predict the propor- 
tion of authentic questions per class made predictions that 


had a significant correlation with the actual proportions, 
r(130) = .50, p < .001, CJg5[.36,.61]. A SMOTEBoost 
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Figure 1: M5P session-level proportion predictions 
on the CLASS 5 data set. 


model trained to predict the authenticity of utterances and 
whose predictions were aggregated to obtain class-session 
level proportions made predictions that had a significant size 
correlation with actual proportions, r(130) = .27, p = .001, 
C'Ig5[.11, .43]. However, these two correlations were signifi- 
cantly different, t(258) = 2.42, p = .017. These results sug- 
gest that class-session level proportion predictions are more 
accurate than aggregating predictions from utterance level 
models. 


Scatterplots of the actual vs. predicted proportion of au- 
thentic questions in the new CLASS 5 data are shown in 
Figures 1 and 2. Perhaps the major difference between 
these two scatterplots is the relationship between predicted 
and authentic proportions for values near zero. For the ag- 
gregated utterance-level predictions generated by SMOTE- 
Boost, the scatterplot in Figure 2 shows a large vertical col- 
umn of predictions above zero, indicating that for values 
near zero the classifier is overestimating the true occurrence 
of authentic questions. Conversely in Figure 1, predictions 
at zero are more tightly clustered. 


Based on these results, it appears that session-level propor- 
tion models like M5P are more forgiving of the imbalanced 
classes than are utterance-level models like SMOTEBoost. 
There are two plausible explanations for why this might be. 
First, the session-level models are predicting a continuous 
number between 0 and 1 rather than making crisp binary 
judgments as in the case for the utterance-level models. 
Continuous predictions more closely match the model’s in- 
ternal probability, as opposed to a binary judgment where 
the binary prediction is the same irrespective of how far the 
model’s internal probability is from the threshold, so long as 
it is on the same side of the threshold. Secondly, utterance- 
level models do not take advantage of the probability of a 
previous utterance’s authenticity in determining the current 
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Figure 2: SMOTEBoost utterance-level predic- 


tions aggregated to session-level proportions on the 
CLASS 5 data set. 


utterance’s authenticity, whereas the session-level models 
are accumulating all of this weak evidence before rending a 
proportion authenticity prediction. Based on this reasoning, 
an additional comparison of interest would be to take the 
utterance-level prediction probabilities and aggregate over 
them instead of the binary classifications. Unfortunately 
in the case of SMOTEBoost, these probabilities are within 
10~° of zero and one, so the results are no different than 
aggregating over class predictions. 


3.2 Proportion model stability 

To demonstrate model stability we undertook two compar- 
isons. First, predictions of a M5P model for the Partnership 
data trained to predict the proportion of authentic ques- 
tions per class session were significantly correlated with the 
actual proportions, r(426) = .42, p < .001, CJo5[.34, .50]. 
This correlation is remarkably similar to the 0.5 correlation 
obtained for the new CLASS 5 data. The similarity in cor- 
relations is particularly noteworthy given the differences be- 
tween data sets: for CLASS 5, the classifier is operating 
over ASR transcribed utterances where authentic questions 
are 3% of the total data, but for the Partnership data, the 
classifier is operating over human transcribed instructional 
questions where authentic questions are 50% of the total 
data. 


Secondly, a J48 model for the Partnership data trained to 
predict the authenticity of utterances and whose predic- 
tions were aggregated to class-session level proportions made 
predictions that were correlated with actual proportions, 
r(426) = .44,p < .001, CIo5[.36,.51]. These two correla- 
tions were not significantly different, t(870) = .37, p = .71. 
Scatterplots of the actual vs. predicted proportion authentic 
questions in the Partnership data in Figures 3 and 4 further 
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Figure 3: M5P session-level proportion predictions 
on the Partnership data set. 


illustrate the similarities of these predictions. The equiv- 
alence between utterance- and session-level models for the 
Partnership data (shown in in Figures 3 and 4) and lack of 
equivalence between utterance- and session-level models for 
the new CLASS 5 data (shown in Figures 1 and 2) serves 
to further illustrate the enhancement to predictive stabil- 
ity that comes from using session-level models for this task. 
When the classes are relatively balanced, as in the case of 
the Partnership data, there is no difference between aggre- 
gating utterance-level predictions and session-level predic- 
tions. However, when the classes are imbalanced, as in the 
case of the new CLASS 5 data, the differences are significant 
and favor the session-level model. 


4. DISCUSSION 


We have presented and validated a method for assessing 
classroom instructional quality based on authentic questions 
that is effective even when such questions are rare. Our 
approach transforms the problem of utterance-level authen- 
tic question classification into the problem of session-level 
regression predicting the proportion of authentic questions. 
This problem transformation outperforms aggregating utter- 
ance-level classifications when classes are imbalanced, is sta- 
ble for both low and high dialogic classrooms, and is stable 
across both automatic speech recognition and human tran- 
scripts. As such it is more appropriate for use in assessing 
classroom instructional quality across a wide range of dia- 
logic discourse, complementing previous work that has in- 
vestigated model generalization in different discourse com- 
munities [17]. Because question asking behavior of this type 
is a common component of the major classroom observation 
protocols in use today (e.g., those used in the MET study 
[11]), this research may potentially be used to help auto- 
mate various protocols in addition to the target protocol 


here, CLASS. 
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Figure 4: J48 utterance-level predictions aggregated 
to session-level proportions on the Partnership data 
set. 


Because many major classroom observation protocols call 
for judgments of quality approximately every 10 minutes, 
session-level proportion predictions are not too dissimilar 
from current practice. A useful point for future research 
would be to obtain data coded with these protocols in ad- 
dition to the speech data we used, subdivide the data into 
10 minute bins, and then calculate accuracy. On the other 
hand, the CLASS protocol is much more fine grained, and 
the current approach sacrifices the utterance-level resolution 
CLASS specifies for robustness. From a teacher professional 
development perspective, fine grained annotations are more 
useful because they can be replayed to the teacher to high- 
light particularly effective portions of the class. Our session- 
level approach in its present form appears to be less useful 
for professional development. 


An avenue for future work would be to combine session- 
level and utterance-level models. For example, a session- 
level model could first be applied to the data, generating 
a session-level prediction variable, and then that variable 
could be used as a feature in an utterance-level model. Pre- 
sumably this would be used by the model as an intercept to 
adjust the baseline probability of authenticity for all utter- 
ances in that session. Of course the session- and utterance- 
level processes could also be jointly modeled, e.g. using a 
hierarchical Bayesian approach. 


Finally, we raise the question of why authentic questions 
were rarer in our new CLASS 5 data collected from 2014- 
2016 compared to the archival Partnership data collected 
from 2001-2003. The question is whether the low rate of 
authentic questions in our new sample is something that 
can reasonably be expected to reoccur, or whether it is the 
product of a relative small homogeneous sample. Indeed we 
find that some of the first studies with CLASS found levels 
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of authenticity between 10% and 30% [14], suggesting that 
the rate of authentic questions in our new sample is in the 
normal range. The fact that rates as low as 10% have been 
observed serve as a warning and challenge to future research. 
In our new CLASS 5 data, authenticity rates of 30% for in- 
structional questions translated to 3% of utterances being 
authentic. Presumably a 10% authenticity rate for instruc- 
tional questions would mean that only 1% of utterances are 
authentic. 
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ABSTRACT 


Since MOOC is suffering high dropout rate, researchers try 
to explore the reasons and mitigate it. Focusing on this task, 
we employ a composite model to infer behaviors of learners 
in the coming weeks based on his/her history log of learning 
activities, including interaction with video lectures, partici- 
pation in discussion forum, and performance of assignments, 
etc. 


The prediction accuracy of our proposed model outperforms 
related methods. Besides, we try combining the model with 
suggested interventions, such as sending reminder emails to 
at-risk learners. Future work, which is currently underway, 
will evaluate its influence on mitigating dropout rate. 


Keywords 
MOOC; dropout; Stacked Sparse Autoencoder; RNN 


1. INTRODUCTION 


Recently, online education, for which landmark concept is 
MOOCs (Massive Open Online Courses), has become a new 
global craze, bringing several MOOC platforms including 
EdX, Coursera, and Udacity, etc. Due to the freedom of 
time and place learning at MOOCs, a large scale of learners 
has been benefit from this new form of online learning. A 
typical course of MOOC lasts for 6-12 weeks, with learners of 
diverse backgrounds and major field. Besides, MOOC learn- 
ers may have different intentions and motivations, causing 
their extents and leave for various reasons. 


Despite the increasing popularity of MOOCs, the extremely 
low rate of completion has been considered from the begin- 
ning. Drop-out is concerned as one of the most critical prob- 
lem of MOOCs. Drop-out indicates situations that a student 
registers a course, watches course materials, or even attends 
the quizzes, but eventually quits without attending the fi- 
nal test. It has been researched that an average completion 
rate of MOOCs comes as low as 7 percent, ranging from 0.8 
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percent in Princeton’s (History of the World since 1300), to 
19.2 percent in the ”Functional Programming Principles in 
Scala” course [7]. MOOC platforms are facing a concerning 
issue due to a high learners’ dropout rate. 


Thus, identifying at-risk learners by predicting their dropout 
probability thus becomes timely important, given that early 
prediction can help instructors provide proper support to 
those learners to retain their learning interests aiming at 
guaranteeing them a regular process of study without do- 
ing a crash job or even dropout. Addressing this task, we 
focus on predicting learners’ state for the next consecutive 
two weeks. We particularly formulate this issue as a multi- 
classification problem, and develop a Stacked Sparse Auto- 
encoder (SSAE)+Softmax model to solve it. Essentially, our 
model has several advantages. First, it incorporates multi- 
ple features based on characterizing learners’ weekly engage- 
ments on the MOOC platform. Second, it discovers correla- 
tions between observed explanatory features. The new com- 
pressed feature representation transformed by SSAE per- 
forms better than the previous one, based on the input of 
classifiers. Third, the model considers both the current and 
previous states to estimate the next states, which makes it 
more flexible to model students’ dynamics. 


By training a model to identify at-risk students, we can ap- 
ply this model on online MOOC platforms, enabling it to 
calculate students’ at-risk-rate regularly and send emails to 
them automatically. Hopefully some of these at-risk stu- 
dents will continue their learning. 


We make contributions in this paper as follows: 


1. We employ different composite models that incorporate 
multiple features to infer behavior in the coming weeks based 
on weekly history of learning data. ‘The model is an end-to- 
end neural network model, which means it can be trained as 
a whole. Our results indicate that model of SSAE+Softmax 
performs best and achieve higher AUC score consistently, 
which is superior to the baseline SVM model. 


2. We try combining the model with suggested interventions 
such as sending reminder emails to at-risk learners. Though 
we do not conduct real experiments of sending emails, the 
paper proposes a preliminary framework of applying exper- 
imental results to determining to whom reminder emails 
should be sent and when to send. 
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3. We explore to what extent each single feature can influ- 
ence dropout probability and try to cluster dropout learn- 
ers by employing k-means clustering algorithm, proving that 
features extracted from course engagements are effective in- 
dicators of which class a low-performing learner belongs to 
separated by their pattern of behaviors. Future work will 
shade light into the relationships between behavior patterns 
of learners and reasons why they quit the course. 


The rest of this paper is organized as follows. Section 2 de- 
scribes the related work. Section 3 presents the description 
of the dataset and features derived from the dataset. Section 
4 introduces our model in detail. Experimental results and 
discussion are presented in Section 5 and 6. Finally, Section 
7 concludes our work in this paper. 


2. RELATED WORKS 


Mitigating MOOC dropout rate is essential for boosting the 
values of MOOCs, thus the mechanisms that can predict 
student dropout become increasingly important. 


Some exploratory analysis suggests that student behavior 
in the discussion forum helps predict attrition. Yang et 
al. [6] present a foundation for research investigating the 
social factors that affect dropout along the way during par- 
ticipation in MOOCs. To operationalize these factors, they 
define metrics related to posting behavior (thread starter, 


3.2 Feature set 

As stated above, our goal is to estimate the probability that 
a student stops engaging with a course for the next two 
weeks, given her/his learning activities up to the current 
time step. 


The dropout probabilities are closely related to learners’ en- 
gagements to courses, which are mainly characterized by 
design of forum, lecture and assessment features. To ex- 
press the time-varying behaviors of learners, we extract 17 
typical features of each week t for each learner 7, denoted 


as vector go) © R'7, as presented in Table 1. It can be 
noticed that, features we selected are vital but highly cor- 
related with each other, and we will introduce a model to 
cancel this redundancy. 


f1-f3 Number of posts in discussions, videos watched, 
problems attempted in week t respectively 

f4-f6 Total number of discussions made, videos 
watched, problems attempted by week t 

£7-f9 Average number of discussions, videos, 
problems attempted per week by week t 


f10-f12 | Average number of discussions, videos, 
| problems attempted per session in week t 
f13 Sum of number of another activities (navigate, 
access, page close, wiki) in week t 


Total number of activities in week t 


post length, content length) and social positioning (posts & 


replies) within the resulting reply network. Similarly, some f15 Total number of active days in week t 
researchers (Ramesh et al. [8]) explore other aspects of dis- Total number of time consumption in week t 


cussion forum such as viewing posts, sentiment. This per- 
spective provides a potentially valuable source of insight for 
design of MOOCs that may be more conducive to social en- 
gagement that promotes commitment and therefore lower 
attrition. It is restrictive in application because it mainly 
lowers attrition of learners who drops out mainly because of 
hard interpersonal connection foundation online. 


Many researchers aim at modeling learning behaviors over 
duration of weeks. Their pursuit is to extract significant fea- 
tures by parsing the clickstream file where each line repre- 
sents a web request. These effective features include lecture 
interaction features, forum interaction features, assignment 
features [1—4,11], which capture the activity level of learners. 


In terms of applied models, Kloft et al. [5] explore support 
vector machines (SVM) to predict the state of learners in 
the later phases of a course. Balakrishnan et al. [2] quantize 
the feature space into a discrete number of observable states 
that are integral to a Discrete Single Stream HMM. Fei et 
al. [9] propose recurrent neural network (RNN) model with 
long short-term memory (LSTM) cells. 


3. DATA SET AND FEATURE SET 
3.1 Dataset 


The learner activity log data came from a publicly held 
data mining competition called KDD CUP 2015. It includes 
79186 learners, each of whom enrolled in at least one course 
of the whole set of 39 courses. In total, the clickstream data 
includes 8,157,277 log records and the longest lifetime of en- 
rollment is 6 weeks. Most of the data is user activity log 
data and course structure data. 
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Total numbers of sessions in week t 


Table 1: List of features derived for week t 


3.2.1 Interactions with forums 

A MOOC forum provides a platform to facilitate the com- 
munication between learners and lecturers. The more ac- 
tively the learners interact with their partners, the more a 
learner feels she/he belongs in the course learning and the 
more likely she/he is to complete the learning tasks. Some 
features, such as viewing a post, receiving a reply, following 
a thread and up-voting, are strong indicators of engagement 
and sense of community [6,7]. 


3.2.2 Interactions with lectures 

Because the lecture videos are the most important learn- 
ing resource for the learning participants, the video playing 
should be investigated, as done by other researchers. Among 
these works, Kim et al. [1] explored some click actions when 
watching videos. These behaviors can be classified into six 
types: skipping, zooming, playing, replaying, pausing, and 
quitting. 


3.2.3 Interactions with assignments 

It is reasonable to hypothesize that an active and engaged 
student would monitor their assignment a few times every 
week because material is released and due on a weekly ba- 
sis. When monitoring this week by week, we can roughly 
estimate how far up-to-date a student is with a course. It 
is acknowledged that if a learner falls behind too much, it 
is hard to catch up and thus determination to complete is 
lost [2]. 
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Furthermore, we observe from the user activity log data 
whether the learners are active in session, as the data con- 
tain multiple records in quick succession. We define the 
elapsed time of two separate sessions as 45 minutes. If the 
gap between a learner’s two consecutive operation is more 
than 45 minutes, we assume that the learner quit and logged 
in again. 


Consequently, for current week t, we obtain a sequence of 
(2) (4) 
i) 


i ! 
(x ) ah avant for each learner 2 across t weeks and the 


corresponding sequence of dropout labels (yh, i, ve yO, yi 


If there are activities associated with student 2 in the coming 
week, the dropout label in week t is assigned as y;(z) = 0, 
otherwise, y,(7) = 1. Notably, all features should be cen- 
tered and normalized to unit standard deviation (mean of 0 
and variance of 1). 


4. OUR MODEL 


4.1 Feature Extractor: Stacked Sparse Autoen- 


coder (SSAE) 


Now suppose that we have extracted weekly features from 
user activity log record, we employ a model named Stacked 
Sparse Autoencoder (SSAE) to discover high level represen- 
tation of input features and correlations among them. In 
this part, we aim to produce a better feature representation 
that can show patterns of behavior for learners. 


Autoencoder neural networks are a serial of models which 
can re-represent features by encoding them into a high level 
representation using a set of parameters and decode it back 
to its original values using another set of parameters. A 
sparse autoencoder neural network consists of an input layer, 
a hidden layer and an output layer, whose size of hidden 
layer is greater than its input layer. The network structure 
is presented in Figure 1. 


hidden layer 


Figure 1: Network Structure 


Formally, let the vector of input layer be the features of 
learner 2 extracted from weekly history of learning behav- 
ior features. We train the network to minimize the diver- 
gence between the input layer and the output layer, 1.e., 
hwo(x$”) = x". After the model goes into convergence, 
which means it achieves a minimal difference between input 
features and output values, the hidden layer learns a new 


representation of the input. The numbers and dimensions 
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of the hidden layer controls the complexity of the network 
and requires parameter values tuning to determine its opti- 
mal value. Notably, the new features that the hidden layer 
represents will be as the input of a classifier. To train this 
autoencoder network, we apply back-propagation algorithm 
to minimize overall cost function as follows: 


Jsparse(W, b) = J(W,b) +BY W-W 


Where J (W, b) is calculated by two parts: an average sum- 
of-squares error and penalty term that helps prevent over 
fitting. ‘*W-W means a sum of every element in matrix 
which is the element wised multiple of W. £6 represents 
weight of the sparsity penalty term. 


Here we do not introduce the details; computational details 
can be found in [10]. 


In order to generate more general (higher-level-presented) 
features, we use a method called stacked to enrich capacity of 
our model. We train an autoencoder first and use its features 
as the input and output of another autoencoder. Thus we 
get a more abstract representation of original features which 
can be more suitable for describing learners’ inner condition. 


Compared with other methods like PCA, the neural net- 
work based SSAE is more strong. For most cases, relations 
between meta features are complex and can not be repre- 
sented by simple functions like linear functions, thus tradi- 
tional methods are not able to separate them well. However, 
neural networks have the ability to fit any function as long 
as it is given enough capacity(e.g. enough depth of layers of 
amount of cells), which ensures it to project meta features 
in an independent orthogonal linear space. 


4.2 Sequenced feature combiner: RNN 

A RNN (or Recurrent Neural Network) is a class of artifi- 
cial neural networks dealing with sequence data. It takes 
sequenced data step by step, and generates an output ac- 
cording to all previous inputs on every step. A basic RNN 
with one hidden layer is shown in Figure 2. 


Output Layer 


Hidden Layer 


Input Layer 


Figure 2: Basic RNN Structure 


Formally, RNN is a function , where h is the hidden status 
(memory) of hidden units, and D is the size of input vec- 
tor and L is the size of the output vector. The memory h 
changes every time while giving new inputs at each step. 


The input vector of RNN is the high level representation 
generated by SSAE introduced in part 2. We aim to get a 
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good feature representation, which can contain all learners’ 
event histories within a fixed-length vector, to make predic- 
tion and classify dropout learners by his/her reason. 


For a simple RNN, it has parameters (Wa, Un, Wy, bn, by), 
where W; controls what to absorb to memory from input 
features, and U; determines what to remember and what to 
forget from the last memory status, and W, sets the output 
value, and b;, and b, are biases who make a global offset to 
both hidden status and output value. 


The computational formula of this kind of RNN is shown 
below: 


he = on(Waxe + Unht-1i + bn) 
Yt = oy(Wyhi + by) 


where x; and y; represents input features and output vector 
at time t, and h; is the memory hold by RNN. Here, o7, and 
ao, can be the same or different activation functions. Typi- 
cal choices of activation functions are the sigmoid function 
and tanh function. Particularly, we choose tanh as activa- 
tion function for both of the formulas. We will apply tanh 
in this paper as it typically yields to faster training (and 
sometimes also to better local minima). The operation tanh 
is calculated as follows: 


e =e" 


tanh(x) = 


We do not apply an LSTM used by other researchers [8] 
because of some reasons. An LSTM is a special kind of 
RNN who has the ability of forgetting, which means it can 
determine what to remember and when to forget its memory 
while getting new inputs, however, a simple RNN can only 
remember all its inputs. We think that, for a sequence no 
longer than six, forgetting should not be accepted. Besides, 
simple RNN requires less calculated quantities which makes 
it more suitable for a large scale online service. 


4.3 Classifier 
4.3.1 Support Vector Machine(SVM) 


Some prior work mentioned in the related work inspires us 
to employ SVM to predict the learning state in the next 
consecutive two weeks. The SVM computes an affine-linear 
prediction function based on maximizing the margin of pos- 
itive and negative ae 


(w, b) -=argminwo5 lll? 


+ ee ae 1 — yi(< w,x > +b)) 


After extracting features, we try to predict by using SVM 
and compare with results from Softmax. As there is distinct 
difference between dropout users and non-dropout users, 
therefore, we use the method of random sampling to con- 
fine the amount of these users into a comparatively small 
one. With this done, the model we gain will not cause over- 
fitting to either classification. 


With learning feature of current week obtained in ’Feature 
set’ Section as input, we apply SVM to predict whether to 
drop out at the end of this week. Three Kernel Functions: 
linear, rbf and mlp are tried, and the prediction accuracy is 
estimated via 5-folds cross validation. 
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4.3.2 Softmax Regression 

In the softmax regression setting, we are interested in multi- 
class classification (as opposed to only binary classification). 
It is expected to classify learners into three cases, which 
can be represented as {(0,0), (0,1), (1,1)}, where 1 means 
dropout, and the first number depends on whether to drop 
out after one week, the latter indicates results after two 
weeks. In this case, the label set can take on 3 different val- 
ues, letting the predicted outcome for i-th learner € {1, 2, 3}. 


We aim to estimate the probability of the class label taking 
on each of the 3 different possible values of each learner. 
Thus, our hypothesis will output a 3-dimensional vector 
(whose elements sum to 1) giving us our estimated 3 prob- 
abilities. Concretely, our hypothesis takes the form: 
yi) = PAGF 0) eft e 
Fs = (i) _ 5),.(4). a\| — 1 oF (4) 
ce = Py ale 8) | a pray |e 
(yO = 3)x™; 6) dae? fe 


Where 61, 62,63 € R" represent model parameters of soft- 
max, and a efi of) generalizes the probability distribu- 
tion, leading to the sum of all the probability is 1. 


5. EXPERIMENTS 

5.1 AUC Score 

We can observe from the KDD cup’s label set that the labels 
are displayed with 79% positives and 21% negatives. Due to 
class imbalance phenomenon, accuracy is not a good metric. 
Instead, Area under receiver operating characteristic curve 
(ROC AUC) is the main metric we use to do parameter tun- 
ing and model selection. Furthermore, AUC measures how 
likely a classifier can correctly discriminate between positive 
and negative samples. 


[| Week T 


a 0. ass us 895 uk 887 ut 803 7 754 
Softmax 


SSAE+ 0.894 0.867 0.849 0.784 0.729 
SVM 


Table 2: AUC comparison of SSAE+Softmax, SSAE+SVM, 
SVM 


Table 2 presents the average AUC scores across weeks by 
applying two different classifiers (Softmax, SVM). The re- 
sults indicate that the models that employ SSAE to discover 
correlations among initial features extracted from dataset, 
such as SSAE + Softmax, SSAE + SVM, are more com- 
petitive. They are superior to the baseline SVM model and 
achieve higher AUC score consistently. For instance, for the 
first week, the AUC score of SSAE+SVM is 0.894, which is 
7.58% improvement relative to that of SVM. 


Specifically, we can observe that our proposed model SSAE 
+ Softmax outperforms the other models across different 
weeks. The observation implies that Softmax performs con- 
sistently better than SVM in terms of classifying a learner’s 
previous states and predicting whether he will drop out. 


More notably, the AUC score decreases with increasing life- 
time of the course. We infer that there might be more un- 
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certainties related with dropout behavior that our model 
could not discover only from weekly history records. Ex- 
ternal forces such as lack of free time may result in more 
complex patterns of behavior. For instance, a learner may 
leave suddenly at week 4, while all statistical features of the 
previous three weeks strongly indicate he is not inclined to 
drop out. 


5.2 Confusion Matrix 

In this two class classification problem, the confusion ma- 
trix is a matrix with 4 entries, true positive(TP), false neg- 
ative(FN), false positive(FP), and true negative(TN). 


P re TP 
recision = TP +4 FP 
TP 
Recall = TruePositiveRate = TPL FN 


Fl=2~x Precision X Recall 


Precision + Recall 


The comparisons of metric mentioned above are presented 
in Table 3. Model of SSAE+Softmax outperforms the other 
models consistently, proving good implement of the predic- 
tion task. It is convincing that the results across weeks lay 
a foundation to identify patterns of behavior and suggest 
interventions for inactive learners. 


SSAE+Softmax 0.3891 0.942 0.916 


SSAE+SVM 0.873 0.907 0.890 
SVM 0.854 0.887 0.870 


Table 3: Performance comparison of SSAE+Softmax, 
SSAE+SVM, SVM 


6. DISCUSSION 


Experimental results of a real-world dataset demonstrate 
that dropout probability is consistently predicable across 
weeks for different students. The next step in applying the 
newly proposed model (SSAE+Softmax) to MOOC plat- 
forms aims to mitigate dropout rate by suggesting inter- 
ventions, such as sending reminder emails, with the goal of 
informing at-risk learners to retain interests. 


Email is a very cheap medium to reach learners and create 
awareness quickly. Our proposed model will contribute to 
determining to whom an email should be sent and when to 
send. Identifying at-risk learners precisely avoids bombard- 
ing active learners with unnecessary emails and at the same 
time informs them in time to call back as many of them as 
possible. 


Here we only present a preliminary framework for sending 
reminder emails. Specifically, at the end of week t¢, first, 
we extract weekly feature vectors for t weeks and employ 
SSAE+Softmax to predict future states yz and yz41. Then, 
we determine a candidate set of potential at-risk learners 
who satisfy yz=1 and yz41 = 1 where y; means status of 
the next week. Finally, we observe her/his behavior in the 
coming week ¢t + 1 for every selected learner. If the ’at risk’ 
state is confirmed (y; = 1), the platform will send reminder 
emails at the end of week t+ 1 immediately. 


Proceedings of the 10th International Conference on Educational Data Mining 


Although the experiments presented in this paper are lim- 
ited to KDD Cup, we plan to augment our model and eval- 
uate the effectiveness of sending reminder emails in a real 
MOOC platform established by our university. Future work 
applying this model is currently underway and the idea for 
sending emails will be improved step by step. 


With features observed as stated in Section 3, we finish the 
analysis of predicting dropout based on model mentioned 
in Section 4. After gauging the goodness of model perfor- 
mance, it is persuadable that we have the ability of pre- 
dicting and diagnosing dropout. In the following part, we 
analyze how each feature could influence final dropout prob- 
ability by conducting sensitivity analysis, and try to cluster 
dropout learners to figure out their patterns of behavior by 
applying k-means algorithm. 


In order to make data comparable, we separate user events 
by different courses and take the course with the most stu- 
dents (which is also the one with the most accomplished 
students) as our studying example. First, we try to find 
out standard learner behaviors of those who accomplish the 
course with a good quality. We simply take all non-dropout 
students’ event logs and take an average on each of the fea- 
tures, and regard this as a medial requirement for finish 
this course. Next, we change each of the features step by 
step and make prediction using our neural networks with 
fixed parameters, and then we get three outputs representing 
probabilities of dropout in one or two weeks, or not dropout. 
We evaluate a score ranging from 0 to 1 to evaluate quality 
of these features. 


Algorithm 1 Univariate analysis of feature_i 


procedure UNIVARIATEA NALY- 
SIS(model, input_features) 
standard < average(input_feature) 
for rate € (0.5...1.5) do 
features <— standard 
features; < rate x feature;; 
EvaluateDropoutRate(model, features); 
end for 


end procedure 


In Algorithm 1, “input-features” are features of those com- 
plete the courses, and “model” is the model we introduced 
above using SSAE, RNN and Softmax to predict a dropout 
rate, which is regraded as a score ranging from 0 to 1. 


Notably, these features representing learning behaviors are 
classified into two categories: those related to course materi- 
als directly (e.g., watching videos, browsing wiki) and those 
not (e.g., navigate, page_close). We test some features to 
show how they influence a learner’s dropout probability, as 
presented in Figure 3. 


When times of watching video is 60 percent the amount 
of the standard statistic, the dropout probability increases 
sharply from 0.12 to 0.875. In this case, the dropout prob- 
ability for feature page_close increases from 0.52 to 0.774, 
less significantly. It implies that, metrics closely related to 
course materials matter more than the others. Compared to 
indirect activities, times of direct engagements with course 
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dropout rate 


0.6 O08 $416 41.2 14 0.6 O08 $1060 12 14 
watch videa page clase 


Figure 3: Sensitivity Analyses 


materials are highly relevant to probability of accomplishing 
the course. 


We then try to cluster dropout learners by employing k- 
means clustering algorithm, in which we set k = 10. Fea- 
tures extracted in Section 3 are effective indicators of which 
pattern of behavior a low-performing learner belongs to. We 
map any feature vector to one of the 10 clusters. There are 
two clusters whose number of low-performing learners are 
apparently larger than the others. 


Inactive learners belonging to one cluster mentioned above 
preform worse with increasing lifetime of engagements. By 
monitoring their learning behavior in terms of lecture video, 
discussion and assignments, we find the numbers decrease 
week by week significantly. It can be inferred that they 
are putting less and less effort into learning as the course 
continues, which is a great indicator of failing to keep up 
with the pace of the course. 


Inactive learners belonging to another cluster display a com- 
plex pattern of behavior. For instance, they leave the course 
for one or two weeks and then come back to learn. At the 
beginning, these learners display a high level of persever- 
ance and self-discipline. Almost all the statistics demon- 
strate that they have regular patterns of studying, which 
can be confirmed by low dropout probability computed by 
our model. However, they behaved poorly in the coming 
weeks. Specifically, for some learners, the number of video 
watched, discussion made, and problems attempted all reach 
O suddenly. After some weeks, these learners come back to 
learn. Meanwhile, all learning data reaches the highest in 
comparison with previous weeks. Finally, they don’t take 
exams and drop out. It may be inferred that such learn- 
ers are "trying but not succeeding”, due to the limit of time 
allowance (maybe other external forces). 


In the future, to extend our model, we will send those learn- 
ers predicted to leave the course a survey to find out why 
they are disengaging. We will shade light into the relation- 
ships between behavior patterns of learners and reasons why 
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they quit the course. 


7. CONCLUSIONS 


In this paper, we propose different composite models that 
incorporate multiple features to infer behavior for the next 
two weeks based on features extracted from weekly history of 
learning data. The SSAE+Softmax model achieves a higher 
AUC score consistently, being superior to the baseline SVM 
model. Besides, application of the model including an auto- 
mated email reminder system is under construction. 


8. ACKNOWLEDGEMENT 
This work is supported by NSFC under Grant No. 61532001 
and 61370054, and MOE-RCOE under Grant No. 2016ZD201. 


9, REFERENCES 

[1] Kim, J., Guo, P. J., Seaton, D. T., Mitros, P., Gajos, 
K. Z., & Miller, R. C. (2014, March). Understanding 
in-video dropouts and interaction peaks inonline lecture 
videos. In Proceedings of the first ACM conference on 
Learning@ scale conference (pp. 31-40). ACM. 

[2] Balakrishnan, G., & Coetzee, D. (2013). Predicting 
student retention in massive open online courses using 
hidden markov models. Electrical Engineering and 
Computer Sciences University of California at Berkeley. 

[3] Halawa S, Greene D, & Mitchell J. Dropout prediction 
in MOOCs using learner activity features|J]. 
Experiences and best practices in and around MOOCs, 
2014, 7. 

[4] He, J., Bailey, J., Rubinstein, B. I., & Zhang, R. (2015, 
January). Identifying At-Risk Students in Massive 
Open Online Courses. In AAAT (pp. 1749-1755). 

[5] Kloft, M., Stiehler, F., Zheng, Z., & Pinkwart, N. 
(2014, October). Predicting MOOC dropout over weeks 
using machine learning methods. In Proceedings of the 
EMNLP 2014 Workshop on Analysis of Large Scale 
Social Interaction in MOOCs (pp. 60-65). 

[6] Yang, D., Sinha, T., Adamson, D., & RosAY, C. P. 
(2013, December). Turn on, tune in, drop out: 
Anticipating student dropouts in massive open online 
courses. In Proceedings of the 2013 NIPS Data-driven 
education workshop (Vol. 11, p. 14). 

[7] Rachelle Peterson. 2013. Why Do Students Drop Out of 
MOOCs? Article. (13 November 2013.). 
https: //www.nas.org/ 
articles/why do students drop out of moocs 


[3] Ramesh, A., Goldwasser, D., Huang, B., DaumATl III, 
H., & Getoor, L. (2014, July). Learning latent 
engagement patterns of students in online courses. In 
Proceedings of the Twenty-Eighth AAAI Conference on 
Artificial Intelligence (pp. 1272-1278). AAAI Press. 

[9] Fei, M., Yeung, & D. Y. (2015, November). Temporal 
Models for Predicting Student Dropout in Massive 
Open Online Courses. In 2015 IEEE International 
Conference on Data Mining Workshop (ICDMW) (pp. 
256-263). IEEE. 

[10] Ng, A. (2011). Sparse autoencoder. CS294A Lecture 
notes, 72, 1-19. 

[11] Henrie, C. R., Halverson, L. R., & Graham, C. R. 
(2015). Measuring student engagement in 
technology-mediated learning. Elsevier Science Ltd. 


173 


Characterizing Collaboration in the Pair Program 
Tracing and Debugging Eye-Tracking Experiment: A 
Preliminary Analysis 


Maureen M. Villamor 
Ateneo de Davao University, Quezon City Philippines 
University of Southeastern Philippines, Davao City, 
Philippines 
maui@usep.edu.ph 


ABSTRACT 


This paper characterized the extent of collaboration of pairs of 
novice programmers as they traced and debugged fragments of 
code using cross-recurrence quantification analysis (CRQA). This 
was a preliminary analysis that specifically aimed to compare and 
assess the collaboration of pairs consisting of two individuals who 
may have different or same level of prior knowledge given a task. 
We performed a CRQA to build cross-recurrence plots using eye 
tracking data and computed for the CRQA metrics, such as 
recurrence rate (RR), determinism (DET), average diagonal length 
(L), longest diagonal length (LMAX), entropy (ENTR), and 
laminarity (LAM) using the CRP toolbox for MATLAB. Results 
showed that low prior knowledge pairs (BL) collaborated better 
compared to high prior knowledge pairs (BH) and mixed prior 
knowledge (M) pairs because of its high RR and DET implying 
that they had more recurrent fixations and matching scanpaths. 
However, the BL pairs’ high ENTR and LAM could mean that 
they seemed to have more difficulty in understanding and 
debugging the programs. All pairs regardless of category had 
more or less exerted the same level of attunement when asked to 
debug the programs as evident in their L values. The mixed pairs 
seemed to have struggled with eye coordination the most as it had 
the most incidences of low LMAX. 


Keywords 


Eye-tracking, Collaboration, Cross-recurrence quantification 


1. INTRODUCTION 


Eye gaze plays an essential role in social interaction processes. In 
Computer-Supported Collaborative Learning (CSCL),  eye- 
tracking had been used in previous works to study joint attention 
in collaborative learning situations [9][16]. Two eye-trackers, for 
instance, can be synchronized for studying the gaze of two 
persons collaborating in order to solve a problem and for 
understanding how gaze and speech are coupled [11-13]. 


The use of gaze coupling was first proposed in [11] to study 
conversation coordination. In this study, they defined gaze 
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coupling as episodes when participants are looking at the same 
target. Their results showed that the coupling of eye gaze between 
collaborating partners may be an indicator of quality interaction 
and better comprehension. In the domain of pair programming, 
Pietinen et al. [10] suggested that gaze closeness could reflect 
tightness of collaboration. More prior studies [1][11-13] have 
shown that the coupling of eye gaze between collaborating 
partners may be an indicator of quality interaction and better 
comprehension and that joint attention, and more generally, 
synchronization between individuals is essential for an effective 
collaboration. 


Cross-recurrence quantification analysis or CRQA, introduced in 
[18], is an extension of Recurrence Quantification Analysis 
(RQA) [7] that is used to quantify how frequently two systems 
exhibit similar patterns of change or movement in time. It takes 
two different trajectories of the same information as input and 
tests between all points of the first trajectory with all points of the 
second trajectory forming a cross-recurrence plot (CRP). The 
CRP permits visualization and quantification of recurrent state 
patterns between two time series. Analysis using CRP’s has been 
proposed as a generalized method to unveil the interlocking of 
two interacting people [2]. It has been used to analyze the 
coordination of gaze patterns between individuals and has been 
used to determine how closely two collaborators’ gaze follow 
each other. In the scientific literature, a cross-recurrence gaze plot 
is considered as the standard way of representing social eye- 
tracking data [16]. 


CRQA was used in [11], which provided the first quantification of 
gaze coordination in their monologue data to analyze the relation 
between eye movements of the speaker and the listener. The 
analysis revealed that the coupling between speaker and listener 
eye-movements predicted how well the listener understood what 
was said. They extended their findings in their succeeding studies 
[12-13] and results revealed that eye movement coupling found in 
monologue indeed extends to dialogues. 


In the context of pair programming, Jermann et al. [5] used 
synchronized eye-trackers to assess how programmers 
collaboratively worked on a segment of code, and they also 
contrasted a “good” and a “bad” pair using cross-recurrence plots. 
Results showed that high gaze recurrence seems to be typical of a 
“good” pair where the flow of interaction is smooth and where 
partners sustain each other’s understanding. A dual eye-tracking 
study was also conducted that demonstrated the effect of sharing 
selection among collaborators in a remote pair-programming 
scenario [4]. They used gaze cross-recurrence analysis to measure 
the coupling of the programmers’ focus of attention. Their 
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findings showed that pairs who used text selection to perform 
collaborative references have high levels of gaze cross-recurrence. 


This paper aimed to use CRQA to characterize collaboration of 
pairs of novice programmers in the act of tracing fragments of 
code and debugging. Specifically, this was a preliminary study 
that attempted to answer the following research question: Using 
CRQA, what characterizes collaboration of pairs consisting of (a) 
both high prior knowledge students, (b) both low prior knowledge 
students, and (c) high- and low-prior knowledge students? 


Although the use of CRQA as an approach to assess collaboration 
between participants in a pair programming eye-tracking 
experiment is not an entirely novel approach, the main 
contribution of this study was the inclusion of the composition of 
the pairs in terms of expertise levels. Previous studies did not 
characterize the pairs based on prior knowledge in programming 
or level of expertise. 


2. METHODS 
2.1 Participants 


The study was conducted in two private universities in the 
Philippines. Students who had taken the _ college-level 
fundamental programming course were recruited to participate in 
this study. Since the study is not finished yet and is still on-going, 
we recruited only 16 pairs of participants as of writing of this 


paper. 


2.2 Structure of the Study 


A screening questionnaire was distributed to student volunteers, 
to determine their eligibility to take part in this study (e.g. no 
cataracts, no implants, etc.), and they were required to undergo an 
eye-tracking calibration test. Participants who passed both 
screenings were given consent letters to fill up and sign. They 
were then asked to take a written program comprehension test (20 
minutes) to determine their level of programming knowledge and 
skills. The actual eye-tracking experiment followed which was 
designed for 60 minutes at the maximum. Two Gazepoint eye- 
trackers were used to collect the pairs’ eye-tracking data. The 
pairs were shown 12 programs with known bugs and were asked 
to mark the location of the bugs with an oval. There was no need 
to correct the errors. 


A slide sorter program with “Previous”, “Reset”, “Finish” and 
“Next” buttons was created to display the program specifications 
followed by the buggy programs. The participants were free to 
click any of the buttons as they liked and were free to navigate the 
slides. No scrolling was needed. When the participant finds a bug, 
he/she clicks on the location of the bug and the software then 
draws an oval to mark it. Figure | 1s an excerpt from a specific 
slide in the slide sorter program showing the ovals. 


The pairs were told to work with their partner on the problems 
and should collaborate using a chat program. All communications 
with their partner was via chat. The participants were seated 
together in the same room but were spaced far enough to ensure 
that all communication with their partners was via chat only. After 
the actual eye-tracking experiment, the pairs were asked to fill up 
a post-test questionnaire privately to assess how well they knew 
each other, how well they thought they collaborated, and how 
they felt about their partner. This study limits its analysis to the 
results of the programming comprehension test and the eye gaze 
data. 
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Figure 1. An excerpt from the slide sorter program showing 
the ovals after marking. 


2.3 Constructing a Cross-Recurrence Plot 

To conduct a cross-recurrence analysis, an N x N matrix called 
cross-recurrence plot is built, which is essentially a representation 
of the time coupling between two time series. The horizontal axis 
represents time for the first collaborator (C1) and the vertical axis 
represents time for the second collaborator (C2). Given two 
fixation sequences of the collaborators, f; and g;, i — 1... N, we 
define the cross- recurrence as r; = | if d (fi, g;) < p, and 0, 
otherwise [7]. 


Recurrence occurs when two fixations from different sequences 
land within a given radius p of each other, where d is some 
distance metric (e.g., Euclidean distance). Cross-recurrence points 
are represented as a black point (pixel) in the plot (see Figure 2). 
For a pixel to be colored, the distance between the fixations of the 
two collaborators has to be lower than a given threshold. If two 
collaborators uninterruptedly looked at two different spots on the 
screen for the entire interaction, the resulting CRP would be 
completely blank (white space in Figure 2). On the contrary, if the 
two collaborators looked at the same spot on the screen 
continuously, the plot would show only a dark line on the 
diagonal. Points exactly on the diagonal of the plot correspond to 
synchronous recurrence, such as, collaborators look at the same 
target at exactly the same time. Points above the diagonal 
correspond to fixations of C2 that happen after C/ has fixated the 
element. Points below the diagonal correspond to C2’s gaze 
leading C/’s. Asymmetries above and below the diagonal line 
could therefore be indicative of leading and following behaviors. 


2.4 CRQA Metrics 


CRQA defines several measures that can be assessed along the 
diagonal and vertical dimensions. For the diagonal dimension, we 
have: recurrence rate, determinism, average and longest length of 
diagonal structures, entropy, and diagonal recurrence profile. For 
the vertical dimension, we have: laminarity and trapping time. The 
definitions that follow are taken from [7]. 
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Figure 2. Example of a Cross-Recurrence Plot 


Cross-Recurrence Rate (RR) represents the “raw” amount of 
similarities between the trajectories of two systems, which refers 
to the degree to which they tend to visit similar state. In eye- 
tracking data, this represents the percentage of cross-recurrent 
fixations. The more closely coupled the two systems are, in terms 
of sharing the same paths, the more recurrences will be formed 
along the diagonal lines. Hence, a high density of recurrence 
points in a diagonal results in a high value of RR. 


Determinism (DET) is the proportion of recurrence points forming 
long diagonal structures of all recurrence points. Relative to eye- 
tracking data, this refers to the percentage of identical scanpath 
segments of a given minimal length in the two scanpaths. 


The average diagonal length (L) reports the duration that both 
systems stay attuned. High coincidences of both systems increase 
the length of these diagonals. High values of DET and L represent 
a long time span of the occurrence of similar dynamics in both 
trajectories. 


The longest diagonal length (LMAX) on a recurrence plot denotes 
the longest uninterrupted period of time that both systems are in 
concurrence, which can be seen as an indicator of stability of the 
coordination. 


Entropy (ENTR) measures the complexity of the attunement 
between systems. In eye-tracking, this represents the complexity 
of the relation between scanpaths of the two eye-movement data. 
ENTR is low if the diagonal lines tend to all have the same length, 
signifying that the attunement is regular; otherwise, ENTR is high 
if the attunement is complex. 


Using the diagonal recurrence profile (DiagProfile) offers the 
possibility of observing the direction of the coordination, that is, 
if there is an asymmetry with one interlocutor leading the other. 


Vertical structures in a CRP quantify the tendency of the 
trajectories to stay in the same region. The laminarity (LAM) of 
the interaction refers to the percentage of recurrence points 
forming vertical lines, whereas trapping time (TT) represents the 
average time two trajectories stay in the same region. 


2.4 Data Preparation and Measures 

Results of the written program comprehension test, post-test and 
the number of bugs identified were recorded. The program 
comprehension results were used to categorize the students as 
having high or low prior knowledge. A student was considered to 
have high prior knowledge if his/her program comprehension 
score was equal to or greater than the median score. Otherwise, 
the student has low prior knowledge. 


The fixation data was cleaned first by removing fixations less than 
100 milliseconds [8]. The number of fixations per slide that 
contained the actual program were segregated and saved on 
separate files. Hence, each participant has at most 12 fixation 
files. Fixation alignment was performed in case of uneven 
number of fixations per program file. Fixation files with 
sequences less than 20 were discarded because it usually returned 
a NaN value when the CRQA was performed using the CRP 
toolbox for MATLAB [7]. 


Given 16 pairs and 12 programs, there should have been 16x12 = 
192 cases, but we only had 179 cases for the analysis since some 
pairs did not finish all 12 programs and some fixations sequences 
were discarded. A cross-recurrence plot was then constructed for 
each pair for every program, and the cross-recurrence analysis was 
performed to get the RR, DET, LMAX, L, ENTR, and LAM. 


The challenge of using CRQA is finding optimal parameters for 
delay, embed, and radius [7]. An optimal delay can be identified 
when mutual information drops and starts to level off. The 
embedding dimension can be determined using false nearest 
neighbors and checking when there is no information gain in 
adding more dimensions. For this experimental data, however, no 
further embedding was done [3]. With an embedding dimension 
of one, delay was also set equal to one since no points were time 
delayed [17]. For this experimental data, the radius, which is the 
threshold that determines if two fixation points are recurrent, was 
set to 5% of the maximal phase space diameter [15] to avoid 
subjective biases when looking at recurrent patterns. 


3. RESULTS AND DISCUSSION 

Of the 16 pairs, there were three (3) both high prior knowledge 
pairs, five (5) both low prior knowledge pairs, and eight (8) mixed 
prior knowledge pairs. The remainder of the text will refer to 
these categories as BH, BL, and M respectively. The CRQA 
metrics per program according to these relationships were 
averaged separately to get the aggregated CRQA metrics. 


The aggregated results were examined to find differences among 
the categories, which entailed looking at incidences of high and 
low values of the CRQA metrics. A value was considered high if 
it was equal to or greater than the mean plus one standard 
deviation and low if it was equal to or less than the mean minus 
one standard deviation. Table 1 shows the descriptive values of all 
aggregated CRQA metrics per program. No further statistical 
measures were performed since there were not too many pairs to 
consider and this was only for hypothesis generation purposes. 


Findings showed that the BH pairs only had incidences of low to 
average RR’s and BL pairs only had incidences of average to high 
RR’s. The M pairs had a mix of high, low, and average RR’s. 
See Table | for high and low RR. Figure 3 shows the boxplots of 
RR in these categories. — 
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Table 1. Descriptive values of the CRQA metric per program 


PRR [0.13 [005 | 006 | 027 [008 [017 
DET [0.42 [0.09 [025 [067 [033 [ost 


PENTR [0.76 | 020 [044 [134 [057 [096 


This could possibly mean that the BL pairs collaborated better 
than BH and M pairs due to its incidences of higher recurrent 
fixations. However, it could also mean that the high RR’s found 
in BL pairs was because of the BL pairs’ greater number of 
fixation points, implying that the BL pairs had spent more time 
comprehending the program flow and finding the errors in the 
program. More time spent could have resulted to more chances of 
having more recurrent fixations. BH and M pairs exhibited the 
same degree of collaboration based on their comparable average 
RR’s with M only slightly higher than BH. It can also be noted 
that the high RR’s observed in all categories were all found in the 
middle programs, possibly indicating that the middle programs 
required more concentration compared to other programs. 
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Figure 3. Boxplots of RR in All Catgories 


The BL pairs only had average to high DET values, whereas BH 
and M pairs both only had low to average DET values. See Table 
1 for high and low DET values. Figure 4 shows the boxplots of 
DET in all categories. The greater number of high DET values 
found in BL pairs could possibly mean that the BL pairs had 
shared more identical scanpaths compared to BH and M pairs. 
Also, since the BL pairs had more occurrences of high RR’s and 
seemed to have spent longer durations in the task; this might have 
resulted to more matching scanpaths compared to BH and M 
pairs. As with RR, BH and M pairs’ average DET were nearly the 
same, indicating the same degree of collaboration as assessed 
through their percentage of identical scanpaths. 


Upon examination of their L values, results showed the BL pairs 
neither had high nor low L values. All but two of their L values 


were below the mean. The M pairs had few occurrences of high L 
values whereas BH pairs had one incidence each of high and low 
L values. Hence, a large majority of their L values were average. 
See Table 1 for high and low L values. Figure 5 shows the 
boxplots of the L values in all categories. These results implied 
that all of the pairs regardless of their expertise level or prior 
knowledge had more or less concentrated and exerted the same 
level of attunement on the given task. However, the M pairs 
possibly exhibited frequent longer durations where the pairs stay 
attuned compared to BH and BL pairs. BL pairs, on the other 
hand, had exhibited frequent shorter durations of attunement. 
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Figure 4. Boxplots of DET in All Catgories 
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Figure 5. Boxplots of L in All Catgories 


As for LMAX, BL pairs seemed to have exhibited better stability 
in terms of eye coordination particularly in the middle programs 
since they had more occurrences of high LMAX values. M pairs 
seemed to have struggled with eye coordination the most because 
of more incidences of low LMAX values. However, the average 
LMAX values of BH and M pairs were comparable, possibly 
indicating that the BH pairs’ eye coordination stability was almost 
the same as M pairs. See Table | for high and low LMAX values. 


Figure 6 shows the boxplots of LMAX< in all categories. 
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The same pattern in DET can also be observed in ENTR in terms 
of the incidences of high and low ENTR. The BL pairs had 
average to high ENTR values, whereas both BH and M pairs only 
had low to average ENTR, with M pairs having more low ENTR 
values than the BH pairs. See Table 1 for high and low ENTR 
values. Figure _7 shows the boxplots of ENTR in all categories. 
These findings imply that the BL pairs seemed to have more 
complex scanpaths in looking for bugs compared to BH and M 
pairs particularly in the middle programs. The BH pairs had the 
least complicated and, hence, more predictable scanpaths but their 
average ENTR was comparable to M pairs’ average ENTR 
indicating that their scanpaths when looking for bugs were almost 
identical. 


100.00 


20.00 


60.00 if 
40.00 i id bil 
4 


20.00 


LMAX 


BH BL M 
CATEGORY 


Figure 6. Boxplots of LMAX in All Catgories 
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Figure 7. Boxplots of ENTR in All Catgories 


As with DET and ENTR, the BL pairs only had average to high 
LAM values, whereas both BH and M pairs only had low to 
average LAM values. See Table | for high and low LAM values 
and Figure 8 for the boxplots. This could imply that the BL pairs 
seemed to have encountered more problems in understanding the 
program and, hence, tended to spend more time in certain regions 
of the code. BH and M pairs, on the other hand, seemed to have 
struggled less in understanding and debugging the programs. . 


LAM 


BH BL M 
CATEGORY 


Figure 8. Boxplots of LAM in All Catgories 


We also examined the number of slide switches between the 
program specification and the buggy program. We observed that 
the BL pairs had the least average number of slide switches 
among the pairs, but with the highest LAM values. This could 
mean that BL pairs tended to spend more time focusing on the 
actual program finding for bugs and switched less frequently 
between the program specification and the buggy program 
compared to other categories. BH and M pairs had higher 
frsequency of slide switches but with the lowest LAM values. BH 
and M pairs probably switched between slides more frequently 
because they just read the program specification to quickly check 
and recheck what the program does and were fast in terms of 
inspecting what was wrong in the actual program. BL pairs 
probably did not mind the program specification too much and 
just focused on the actual program locating bugs for the most part 
of the task. 


Overall, it can be noted that for all the pairs, more evidences of 
collaboration and concentration happened in the middle part of 
the task. Perhaps, all the pairs perceived the middle programs the 
most difficult to debug. 


4. SUMMARY AND CONCLUSION 


The goal of this paper was to characterize the collaboration 
between pairs of novice programmers in the act of tracing and 
debugging a program in an attempt to understand the 
collaborative relationship of two individuals on a given task. 
Their collaboration was assessed through their CRQA metric 
results. 


Findings showed that BL pairs are characterized with high RR, 
high DET, high ENTR and high LAM. Their high RR and DET 
signify that BL pairs are inclined to collaborate with their peers 
more compared to BH and M pairs. However, their high ENTR 
may signify complicated scanpaths in looking for bugs and their 
high LAM imply tendencies to stay in same regions of the code, 
which implies further that they frequently have difficulties in 
understanding and debugging the programs. 


All pairs regardless of category tend to exhibit the same level of 
attunement in debugging as evident in their L values. The M 
pairs, however, are characterized as having more incidences of 
LMAX values, which could mean that they tend to struggle with 
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eye coordination the most. Overall, BH and M pairs are 
comparable in terms of collaboration as assessed through their 
CRQA results. We hypothesized, therefore, that the presence of a 
participant with high prior knowledge in M pairs may have 
contributed to the similarity between BH and M pairs 
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ABSTRACT 


This study takes a novel approach toward understanding success 
in a math course by examining the linguistic features and affect of 
students’ language production within a blended (with both on-line 
and traditional face to face instruction) undergraduate course 
(n=158) on discrete mathematics. Three linear effects models 
were compared: (a) a baseline linear model including non- 
linguistic fixed effects, (b) a model including only linguistic 
factors, (c) a model including both linguistic and non-linguistic 
effects. The best model (c) explained 16% of the variance of final 
course scores, revealing significant effects for one non-linguistic 
feature (days on the system) and two linguistic features (Number 
of dependents per prepositional object nominal and Sentence 
linking connectives). One non-linguistic factor (/s a peer tutor) 
and two linguistic variables (Words related to self and Words 
related to tool use) demonstrated marginal significance. The 
findings indicate that language proficiency is strongly linked to 
math performance such that more complex syntactic structures 
and fewer explicit cohesion devices equate to higher course 
performance. The linguistic model also indicated that less self- 
centered students and students using words related to tool use 
were more successful. In addition, the results indicate that 
students that are more active in on-line discussion forums are 
more likely to be successful. 


Keywords 


NLP, math, student success, on-line learning 


1. INTRODUCTION 


Cognitive skills are crucial for student success in the math 
classroom. While research has primarily focused on skills that 
strongly overlap with math knowledge including spatial attention 
and quantitative ability [1], cognitive skills supporting math 
success such as language ability remain under-researched. At the 
same time, a number of researchers have argued that language 
skills are a prerequisite for transferring cognitive operations 
between math and language domains and that lower language 
skills can present critical obstacles in math reasoning. 


Prior research has examined links between language skills and 
math success to examine the premise that students with greater 
language abilities are better able to engage with math concepts 
and problems. This research is based on the notion that success in 
the math classroom can be partially explained through language 
skills that allow students to constructively participate in math 
discussions as well as to quantitatively engage with math 
problems [2]. Similarly, math literacy is thought to be not just 
knowledge of numbers and symbols, but also knowledge of 


language to understand the discourse of math (1.e., the words 
surrounding the numbers and symbols) [3]. 


Despite research that links language skills to math success in the 
classroom, a major methodological problem in previous studies is 
the reliance on correlational analyses among standardized tests of 
math and linguistic knowledge. For instance, several studies have 
looked at the links between tests of language proficiency (e.g., 
syntax, knowledge, verbal ability, and phonological skills) and 
success on tests of math knowledge (e.g. algebraic notation, 
procedural arithmetic, and arithmetic word problems [4, 5]). Other 
studies have compared success on _ standardized math tests 
between native speakers of English and second language speakers 
of English with lower linguistic ability [6, 7]. While a few studies 
have focused on the perceived linguistic complexity of math 
problems in standardized tests [8, 9], the majority of studies have 
not analyzed the actual language produced by students and the 
relationship between language complexity and success on math 
assessments (see [10] for an exception). 


This study builds on the work of Crossley et al. [10] and examines 
links between the complexity of language produced by students in 
on-line question/answer forum in a blended math class and their 
success in the course. To do so, we examine students’ forum posts 
within the on-line tools used in the class for a number of linguistic 
features related to text cohesion, lexical sophistication, syntactic 
complexity, and sentiment derived from natural language 
processing (NLP) tools. The goal of this study is to examine the 
extent to which the linguistic features produced by students are 
predictive of their final scores in a blended discrete mathematics 
course. In addition to the linguistic features, we also examined a 
number of non-linguistic factors that are potentially predictive of 
math success including: whether the student was a peer tutor, 
class section (of two sections), and on-line forum behavior 
including: how many times they viewed posts, how many posts 
they made, how many questions they asked, how many answers 
they provided, and how many days they visited the on-line class 
forum. 


1.1 Language and Math Relationships 

Prior studies have investigated the connections between language 
proficiency and math skills in native speakers (NS) of English. 
These studies generally demonstrate strong links between math 
ability and language ability. For instance, Macgregor and Price [5] 
found that students who scored high on an algebra test also scored 
well on language tests. A follow-up study using a more difficult 
algebra test found a stronger relationship between algebraic 
notation and language ability. Similarly, Vukovic and Lesaux [4] 
reported links between language and math skills, but that the 
language skills differed in their degree of relation with math 
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knowledge. For example, general verbal ability was indirectly 
related through symbolic number skills while phonological skills 
were directly related to arithmetic knowledge. Other research has 
focused on the indirect links between math and language skills. 
For example, Hernandez [11] analyzed students’ scores from the 
reading and math sections of a standardized test and found 
significant positive correlations between reading ability and math 
achievement. These findings led Hernandez to recommend that 
students’ reading skills and strategy training should be factored 
into math instruction in order to increase effectiveness, especially 
for poor readers. However, not all studies have found strong links 
between math knowledge and language skills. For instance, 
LeFevre et al. [1] reported that linguistic skills were related to 
number naming, that quantitative abilities were related to 
processing numerical magnitudes, and that spatial attention was 
related to a variety of numerical and math tests. However, non- 
linguistic features such as quantitative abilities and _ spatial 
attention were stronger predictors of math ability. 


In terms of language production, only one study, to our 
knowledge, has examined the links between the language 
produced by students and their success in the math classroom. 
Crossley et al. [10] examined linguistic and non-linguistic features 
of elementary student discourse while students were engaged in 
collaborative problem solving within an on-line math tutoring 
system. Student speech was transcribed and NLP tools were used 
to extract linguistic information related to text cohesion and 
lexical sophistication. They examined links between the linguistic 
features and pretest and posttest math performance scores as well 
as links with a number of non-linguistic factors including gender, 
age, grade, school, and content focus (procedural versus 
conceptual). Their results indicated that non-linguistic factors are 
not predictive of math scores but that linguistic features related to 
cohesion, affect, and lexical proficiency explained around 30% of 
the variance in students' math scores. Specifically, higher scoring 
students produced more cohesive texts that were more 
linguistically sophisticated. 


1.2. Current Study 


A number of studies have demonstrated strong links between 
students’ linguistic knowledge, affect, and their success in math. 
Studies examining these links have traditionally relied on 
correlational analyses between linguistic knowledge tests and 
standardized math tests [1, 3, 4]. In this study, we take a novel 
approach and examine the linguistic features and affect of 
students’ language production in a blended math class with both 
on-line and traditional face to face instruction. To derive our 
variables of interest, we analyzed the linguistic and affective 
features produced by the students in their forum postings using a 
number of NLP tools. These tools extract information related to 
text cohesion, lexical sophistication, syntactic complexity, and 
sentiment. In contrast to most prior studies (see [10] for an 
exception), our interest is not on linguistic performance as 
measured by standardized tests, but on linguistic performance as a 
function of language production as found in students’ forum posts. 


Our criterion variables are students’ final score in the semester- 
long blended math class. In addition to examining relations 
between linguistic features of student language production and 
math scores, we also examined a number of non-linguistic factors 
including: whether the student was a peer tutor; how many times 
they viewed posts in the on-line forum; how many posts they 
made in the on-line forum; how many answers they provided in 
the on-line forum; how many questions they asked in the on-line 
forum; how many days they visited the on-line forum; and class 
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section (there were two sections). Thus, in this study, we 
addressed two research questions: 


1. Are non-linguistic factors significant predictors of math 
performance in a blended math class? 

2. Are linguistic factors related to lexical sophistication, 
cohesion, syntactic complexity, and affect significant 
predictors of math performance in a blended math class? 


2. METHOD 
2.1 The Blended Math Class: Discrete Math 


Discrete Mathematics is an undergraduate math course offered by 
the computer science department at North Carolina State 
University. Students in the course are provided instruction on the 
mathematical tools and abstractions that are integral to a general 
CS education, including logic, truth tables, set theory, graphs, 
counting, induction, recursion, and functions. Students majoring 
in CS must complete the course with a grade of C or better in 
order to remain in their degree program. The course includes 10 
homework assignments, 5 lab assignments, 3 midterms, and a 
final exam. 


The discrete math course studied is a blended course. In addition 
to the standard lecture and office hours, students are supported by 
a range of on-line tools. These include a Piazza question/answer 
forum, on-line homework assignments through WebAssign, and 
two labs that are Intelligent Tutoring Systems for logic and 
probability. Our focus in this analysis is the Piazza data. Piazza is 
a standard question-answering forum. Students, teaching 
assistants (TAs), and instructors are allowed to post questions or 
topic prompts as well as general polls. The members of the class 
may then respond to these posts with replies and sub-replies. They 
may also choose to recommend both posts and replies as being 
particularly informative but clicking on a “good question” or 
“good answer” button. Question responses are classified in Piazza. 
The instructors and TAs may post an official "instructor 
response". If that is done, then these are flagged separately from 
student replies. Individuals may edit their replies over time in 
response to users' comments. While Piazza may be configured to 
permit anonymous posting by students, this function was turned 
off by default in this course. In addition to the basic thread 
structure, Piazza requires that posts be categorized by topic and it 
keeps a running list of threads and supports basic search to help 
students locate relevant information. 


We study data from the Fall 2013 semester of this course. During 
that semester, the class was divided into two sections with two 
primary instructors, five teaching assistants, and 250 students. In 
addition to the instructor and official graduate TAs, the course 
was supported by a set of peer tutors. These are high-performing 
students in the course who are given extra credit for acting as 
mentors. During the Fall 2013 semester, 32 students volunteered 
to act as peer tutors and roughly 1/3 of them completed the 
required 10 hours to receive extra credit. 


For the purposes of our analysis, we collected Piazza data 
recording the students' interactions once the course was complete. 
This data included how many times students viewed posts in the 
Piazza forum, how many posts students made in the Piazza forum, 
how many answers students provided in the Piazza forum, how 
many questions students asked in the Piazza forum, and how 
many days students visited the Piazza forum. 


Proceedings of the 10th International Conference on Educational Data Mining 181 


2.2 Forum Posts 


We selected forum posts because they provide students with a 
platform to exchange ideas, discuss lectures, ask questions about 
the course, and seek technical help, all of which lead to the 
production of language in a natural setting. Such natural language 
can provide researchers with a window into individual student 
motivation, linguistics skills, writing strategies, and affective 
states. This information can in turn be used to develop models to 
improve students’ learning experiences [12]. 


Students in the course were given access to the Piazza forum at 
the start of the class. Students were encouraged to use Piazza (not 
email) for course communications by posting their questions to 
the forum outside of class, and answering questions posed by their 
peers. The TAs and peer tutors were required to check the forum 
regularly with the goal of ensuring an average response time of 15 
minutes per post, and that no single question would "go stale" by 
being left for more than 2 hours without a reply. In addition to 
basic question/reply Piazza interactions, the instructor and TAs 
posted regular announcements and general comments to the 
forum, making it the primary vehicle for non-lecture 
communication in the course. 


Student posts were retrieved from a Piazza database that was 
extracted at the end of the course. The student posts were 
segmented out to eliminate duplicate content as well as 
unnecessary markup. Of the 250 students in the course, 169 made 
posts on the forum. For the 169 students who made a forum post, 
we aggregated each of their posts such that each post became a 
paragraph in a text file. We selected only those students who 
produced at least 50 words in their aggregated posts (n = 158). We 
selected a cut off of 50 words in order to have sufficient linguistic 
information to reliably assess the student’s language using NLP 
tools. 


2.3. Natural Language Processing Tools 

We used several NLP tools to assess the linguistic features in the 
ageregated posts of sufficient length. These included the Tool for 
the Automatic Analysis of Lexical Sophistication (TAALES) [13], 
the Tool for the Automatic Analysis of Cohesion (TAACO) [14], 
the Tool for the Automatic Analysis of Syntactic Sophistication 
and Complexity (TAASSC) [15], and the SEntiment ANalysis and 
Cognition Engine (SEANCE) [16]. The selected tools reported on 
language features related to lexical sophistication, text cohesion, 
and sentiment analysis respectively. The tools are discussed in 
greater detail below. 


2.3.1 TAALES 


TAALES incorporates about 150 indices related to basic lexical 
information (e.g., the number of tokens and types), lexical 
frequency, lexical range, psycholinguistic word information (e.g., 
concreteness, meaningfulness), and academic language for both 
single word and multi-word units (e.g., bigrams and trigrams). 


2.3.2  TAACO 


TAACO incorporates over 150 classic and recently developed 
indices related to text cohesion. For a number of indices, the tool 
incorporates a part of speech (POS) tagger and synonym sets from 
the WordNet lexical database [17]. TAACO provides linguistic 
counts for both sentence and paragraph markers of cohesion and 
incorporates WordNet synonym sets. Specifically, TAACO 
calculates type token ratio (TTR) indices, sentence overlap indices 
that assess local cohesion, paragraph overlap indices that assess 
global cohesion, and a variety of connective indices such as 


logical connectives (e.g., moreover, nevertheless) and sentence 
linking connectives (e.g., nonetheless, therefore, however). 


2.3.3 TAASSC 


TAASSC measures large and fined grained clausal and phrasal 
indices of — syntactic complexity and  usage-based 
frequency/contingency indices of syntactic sophistication. 
TAASSC includes 14 indices measured by Lu’s [18] Syntactic 
Complexity Analyzer (SCA), 31 fine-grained indices or clausal 
complexity, 132 fine-grained indices of phrasal complexity, and 
190 usage-based indices of syntactic sophistication. The SCA 
measures are classic measures of syntax based on t-unit analyses 
[19]. The fine-grained clausal indices calculate the average 
number of particular structures per clause and dependents per 
clause. The fine-grained phrasal indices measure 7 noun phrase 
types and 10 phrasal dependent types. The syntactic sophistication 
indices are grounded in usage-based theories of language 
acquisition [Ellis, 2002] and measure the frequency, type token 
ratio, attested items, and association strengths for verb-argument 
constructions (VACs) in a text. 


2.3.4 SEANCE 


SEANCE is a sentiment analysis tool that relies on a number of 
pre-existing sentiment, social positioning, and _ cognition 
dictionaries. SEANCE contains a number of pre-developed word 
vectors developed to measure sentiment, cognition, and social 
order. These vectors are taken from freely available source 
databases. For many of these vectors, SEANCE also provides a 
negation feature (1.e., a contextual valence shifter) that ignores 
positive terms that are negated (e.g., not happy). SEANCE also 
includes a part of speech (POS) tagger. 


2.4 Statistical Analysis 


We calculated linear models to assess the degree to which 
linguistic features in the students’ language output along with 
other fixed effects (e.g., question/note posted, questions answered, 
site visits) were predictive of students’ final math scores. Prior to 
linear model analysis, we first checked that the linguistic variables 
were normally’ distributed. We also controlled for 
multicollinearity between all the linguistic and non-linguistic 
variables (7 > .900) such that if two or more variables were highly 
collinear, all but one of the variables was removed from the 
analysis. We used R [21] for our statistical analysis. Final model 
selection and interpretation was based on ¢ and p values for fixed 
effects and visual inspection of residuals distribution for non- 
standardized variables. To obtain a measure of effect sizes, we 
computed correlations between fitted and observed values, 
resulting in an overall R° value for the fixed factors. We 
developed and compared three models: (a) a baseline linear model 
including non-linguistic fixed effects, (b) a model including only 
linguistic factors, (c) a model including both linguistic and non- 
linguistic effects. We compared the strength of each model using 
Analyses of Variance (ANOVAs) to examine which models were 
most predictive. 


3. RESULTS 
3.1 Non-linguistic Linear Model 


A linear model considering of all non-linguistic fixed effects 
revealed significant effects for whether the student was a tutor or 
not and number of days spent on the Piazza forum. Table 1 
displays the coefficients, standard error, t values, and p values for 
each of the significant non-linguistic fixed effects. The overall 
model was significant, F(3, 154) = 6.116, p < .001, r = .326, RS 
.107. Inspection of residuals suggested the model was not 
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influenced by homoscedasticity. The non-linguistic variables 
explained around 11% of the variance of the math scores and 
indicated that students who acted as peer tutors and visited the 
system more often received higher overall grades in the class. 


Table 1. Non-linguistic model for predicting math scores 


Fixed Effect Coefficient era t 
(Intercept) 83.988 1.484 56.603*** 
Is a peer tutor 5.410 1.995 2.712** 
Is not a peer tutor 3.340 2.090 1.598 
Days on system 0.038 0.012 3.116** 


Note * p < .050, ** p< .010, **p < .001 


3.2 Linguistic Linear Model 

A linear model including linguistic fixed effects revealed 
significant effects for a number of features related to reference 
self, syntactic complexity, reference to tools, and cohesion. Table 
2 displays the coefficients, standard error, ¢ values, and p values 
for each of the linguistic fixed effects. The overall model was 
significant, F(4, 153) = 9.456, p < .001, r = .360, R° = .130. 
Inspection of residuals suggested the model was not influenced by 
homoscedasticity. The linguistic variables explained around 13% 
of the variance of the math scores and indicated that students who 
referred to themselves less often, used more complex syntax, 
referred to words related to the use of tools, and used fewer 
sentence linking terms received higher final grades in the course. 
An ANOVA comparison between the non-linguistic model and 
the linguistic found a significant difference between the models, 
(F = 8.120, p < .001), indicating that linguistic features 
contributed to a better model fit than non-linguistic features. 


Table 2. Linguistic model for predicting math scores 


Std. 


Fixed Effect Coefficient t 
Error 

(Intercept) 91.089 3.795 24.002*** 

Words related to self -67.146 26.024 -2.580* 


Number of dependents 


per prepositional 6.800 2.478 2.744** 
object nominal 

Words related to tools 144.097 62.658 2.300* 

penience linking = _77.055 33.947 2.27% 

connectives 


Note * p < .050, ** p <.010, **p <.001 


3.3. Full Linear Model 


A linear model considering non-linguistic and linguistic fixed 
effects revealed significant effects for one of the non-linguistic 
features (days on the system) and two of the linguistic features 
(Number of dependents per prepositional object nominal and 
Sentence linking connectives). One non-linguistic factor Us a peer 
tutor) and two linguistic variables (Words related to self and 
Words related to tool use) demonstrated marginal significance. 
Table 3 displays the coefficients, standard error, ¢ values, and p 
values for each of the fixed effects. The overall model was 
significant, F(7, 150) = 9.295, p < .001, r = .399, R° = .159. 
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Inspection of residuals suggested that the model was not 
influenced by homoscedasticity. The non-linguistic and linguistic 
variables explained around 16% of the variance of the math scores 
and followed the same trends as reported in the first two models. 
An ANOVA comparison between the full model and the linguistic 
model found a significant difference between the models, (F = 
2.790, p < .050), indicating that a combination of non-linguistic 
and linguistic features contributed to a better model fit than 
linguistic features alone. 


Table 3. Full model for predicting math scores 


Std. 


Fixed Effect Coefficient Recor t 
(Intercept) 86.564 4.065 21 296°" * 
Is a peer tutor 3.840 1.974 1.946 

Is not a peer tutor 1.516 2.065 0.734 
Days on system 0.028 0.012 2.213" 
Words related to self -44.990 26.876 -1.674 


Number of dependents 


per prepositional 6.156 2.455 2.507* 
object nominal 

Words related to tools 120.451 62.545 1.926 
ener amia 72.463 33.644. -2.154* 


connectives 


Note * p < .050, ** p <.010, **p < .001 


4. DISCUSSION AND CONCLUSION 


Previous research has indicated that language skills are related to 
math success. Much of this research examined links between 
standardized tests of language proficiency and success on tests of 
math knowledge [4, 5] while other research has compared native 
English speakers to second language speakers of English in terms 
of success on standardized math tests [6, 7]. In general, these 
studies have yielded positive relationships between language 
skills and math success. However, the majority of these studies 
did not examine links between the language produced by students 
and math success. A notable exception to this is Crossley et al.’s 
[10] study that used NLP tools to examine links between language 
used in an third grade math classroom and success on math 
assessments. This study reported that linguistic features related to 
cohesion, affect, and lexical proficiency explained around 30% of 
the variance in the math scores. 


In this study, we take a similar approach to Crossley et al. [10] 
and use NLP tools to extract a number of linguistic and sentiment 
features from forum posts found in a blended discrete math 
undergraduate course. We found that a number of non-linguistic 
and linguistic features were strong predictors of math success. For 
instance, peer tutors and students who spent more time on the 
Piazza forums were more likely to be successful in the class. 
Linguistically, students who used fewer words related to self, 
more syntactically complex sentences, more words related to tool 
use, and fewer connectives were also more successful in the class. 
The non-linguistic model explained about 11% of the variance in 
the math scores while the linguistic model explained about 13% of 
the variance. A model that included both non-linguistic and 
linguistic variables explained about 16% of the variance in the 
math scores. 
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The variance explained by our model was lower than that reported 
in Crossley et al [10]. However, unlike Crossley et al., our 
participants were not elementary level students and they were not 
involved in collaborative discourse. Rather, our participants were 
college students and the language samples used in this study came 
from on-line forum posts as compared to natural conversation 
between students in a classroom. These differences likely explain 
the disparities reported between the two studies. For instance, in 
the current study we found a negative correlation between a 
cohesion index (sentence linking connectives) and math scores. 
This may be the result of linguistic development in which young 
children develop text cohesion using explicit markers of cohesion 
while college students use complex syntax to develop cohesive 
text [22, 23]. This distinction likely indicates that the strong 
positive correlation between syntactic complexity and math 
success reported in this study indicates that more skilled writers 
have greater success in the math classroom. 


This study also found that a number of different indices than those 
reported by Crossley et al. were predictive of math success. These 
included words related to self, which was negatively associated 
with math success, and words related to tool use, which was 
positively associated with math success. The finding for words 
related to self should likely be interpreted in terms of self-centered 
behavior such that students who were more self-centered were 
likely to be less successful in the math class. This may be a result 
of the collaborative nature of the Piazza forum in which students 
were encouraged to work together to answer questions and solve 
problems. In terms of words related to tool use, the findings likely 
indicate that more successful students used terms that were more 
strongly related to the domain such as computer, equipment, file, 
machine, mechanism, and paper. However, it is notable that 
neither the use of words related to self or to the use of tools were a 
significant predictor in the full model that included both linguistic 
and non-linguistic variables. 


In terms of non-linguistic features, this analysis demonstrated that 
two non-linguistic factors were important indicators of math 
success: peer tutoring and days on Piazza. The findings indicate 
that those students who volunteered to peer tutor were more 
successful in the class. In addition, those students who spent a 
greater number of days on the Piazza forum were more successful 
suggesting that engagement in the class discussion forum led to 
greater success. However, only the number of days spent on the 
Piazza forum was a significant predictor in the full model. 


The findings from this study have practical implications for 
understanding math performance in a blended math class at the 
university level. Specifically, the findings provide additional 
support that language proficiency is strongly linked to math 
performance such that more complex syntactic structures and 
fewer explicit cohesion devices equate to higher course 
performance. The linguistic model also indicated that less self- 
centered students and students using words related to tool use 
were more successful. In addition, the results indicate that 
students who are more active in on-line discussion forums are 
more likely to be successful. The study also provides a contrast to 
early research [10] in that differences are reported between age 
levels (elementary and college level students) and learning 
environments (collaborative discussions and forum posts). Future 
studies can build on these results by continuing to examine 
language features and math success in a number of different 
student populations and learning environments. 
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ABSTRACT 


BKT and other classical student models are designed for binary 
environments where actions are either correct or incorrect. ‘These 
models face limitations in open-ended and data-driven environ- 
ments where actions may be correct but non-ideal or where there 
may even be degrees of error. In this paper we present BKT- 
SR and RKT-SR: extensions of the existing BKT model that 
distinguish knowing how to apply a skill from knowing when. 
We compare their relative performance to that of classical BKT 
and PFA on data collected from Deep Thought, an open-ended 
propositional logic tutor. We develop basic performance curves 
for student outcomes to help us visually compare models pre- 
dictions to data. We also introduce a new approach for finding 
a probability distribution of actions in ranked, multiple option 
environments with RKT and RKT-SR. Our results show that 
knowing when to use skills is more important than how in these 
open-ended contexts. In fact, including the how components 
may hurt performance if implemented naively. Furthermore we 
show that ranked models outperform binary-based models even 
under restrictive assumptions. 


Keywords 
Student-Modeling, Data-Driven Tutoring, Open-Ended Tutors, 
BKT, PFA, Interaction Networks, RKT, RKTSR, BKTSR 


1. INTRODUCTION 


Bayesian Knowledge Tracing (BKT) and other existing learner 
models, such as Performance Factors Analysis (PFA), are about 
right and wrong but for many realistic problem-solving situa- 
tions students are not choosing just correct or incorrect actions. 
They are choosing from among a range of potential actions some 
of which may be optimal or substantively better than others. 
Thus the classical models are out of sync with the performance 
criteria by which the students are being judged. It also means 
that the models, by design, conflate two distinct skills: knowing 
how to apply a skill (procedural knowledge), and knowing when 
to apply a skill (tactical knowledge). In classical BKT we base 


*Corresponding author 


Proceedings of the 10th International Conference on Educational Data Mining 


performance on the validity of an individual action not on its 
optimality. Thus students receive points for correctly applying 
sub-optimal skills. 


In this paper we present an extension to BKT, BKT-SR, which 
separates tactical knowledge (recognition of optimal skills) from 
procedural knowledge (correct skill application). This model 
is designed for use in open-ended and data-driven tutorial do- 
mains where students are expected to learn not just how to 
apply individual skills but how to recognize the sequence of skill 
applications that make up an optimal solution. We also present 
a refinement of the existing probability calculations for ranked 
options, and apply these in two new models: RKT and RKT-SR. 
This refinement leads to an improvement in the accuracy of the 
models over existing methods. 


Additionally, in order to investigate which component of BKT- 
SR is most important, we tested the individual components 
(how, when, and some slight variations) on student data. Our 
data is drawn from an open ended propositional logic tutor 
called Deep Thought that is designed for use in discrete math- 
ematics and philosophy. We compare the differing models on 
our data set to demonstrate that knowing when to apply a skill 
is separable from knowing how. 


2. EXISTING MODELS 

BKT and PFA are two of the most successful student model- 
ing approaches. Both are binary action models that predict 
whether a student will take actions that are correct or incorrect 
at any given time given their level of understanding and other 
parameters. In prior head-to-head comparisons the two have 
performed similarly [5]. 


BKT is a simple two state Hidden-Markov Model (HMM) [3]. It 
is based upon five assumptions. Each skill is independent and has 
two states: learned, L, and not learned. Each problem depends 
on exactly one skill, and answers are either correct or incorrect. 
Students can learn, but cannot forget. After an opportunity 
to apply a skill, there is a constant probability to transition, 
T, from unlearned to learned. Students who know a skill will 
answer a problem correctly unless they slip, S, and students 
who don’t know a skill answer incorrectly, unless they guess, G. 


The parameters of BKT are: LO, the initial probability of know- 
ing a skill. T, the probability of transitioning from unlearned 
to learned. G, the probability of answering a question correctly 
when a skill is not learned. 5, the probability of answering a 
question incorrectly when a skill is learned. 
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Let L; be the probability of knowing a skill at step i. Then the 
probability of answering a problem correctly is calculated as: 


P(Correct) = L;-(1—S)+(1—Li)-G 
To update L, we first apply Bayes theorem, then apply the 
transition probability. ‘The reinforcement process has two steps: 
L;,-(—S) 
B;(Answer) = eae eae 
L;,-S+(1—-L;)-(1-G) 
[441 = B;(Answer)+T’ (1— B;(Answer)) 


Answer is correct 


Answer is incorrect 


BKT is time tested, easily interpreted and implemented, but fit- 
ting BKT parameters is difficult. One difficulty lies in avoiding 
degenerate parameters: parameters that cause BKT to behave 
counter to its’ physical interpretation. We avoid degenerate 
models using brute force grid search [5]. 


PFA, by contrast, is a logistic regression model based upon 
the skill difficulty(3), number of successes (7), and number of 
failures(p) [11]. PFA has many upsides, not the least of which is 
that it can be fit efficiently with general regression calculations. 


3. INTERACTION NETWORKS 


The above models were designed for classical binary problems. 
Most realistic problems however are more open-ended. Problems 
are defined by a goal state and a set of given information that 
problem solvers may apply a range of rules to achieve their goal. 
Rather than each action being correct or incorrect some actions 
are correct in a given solution context and there many be many 
ways to solve a problem or many actions to take at a given time 
with some being more efficient than others. The structure of 
these open-ended solutions can be efficiently represented in a 
data structure called an interaction network. Interaction Net- 
works are directed graphs representing a solution space where 
each node is a partial solution state and each edge is a rule 
application [4]. Individual solutions are represented as paths in 
the network from the start state to a goal state. An Interaction 
Network is the aggregation of all the student solutions for a 
problem where each edge is weighted by the number of students 
who followed it. 


3.1 Value Iteration 

Value iteration is an algorithm for identifying the optimal policy 
(7) for use in a Markov Decision Process (MDP) [1]. The core of 
the algorithm depends upon an update function that estimates 
the current value of a state (Vi+i(s)) based upon a set reward 
(R), the current values of the neighboring states (Vi(ue)), a 
discount factor or cost for taking each action (7), and the prob- 
ability of taking an action (P(e)). In these experiments we use 
a constant reward function and a discount factor. Goal states 
were assigned a constant value, and the probability of a given 
action (P(e)) transitioning from state s to s’ was estimated 
based upon the number of times that it was taken relative to 
the total number of steps out of s. 


For the purposes of our study we defined two forms of the value 
function. The optimistic function assumes that students will 
take the best possible action in a given state and thus the best 
possible route to a goal. The conservative function, by contrast, 
assumes that they will follow the general probability distribution 
of the dataset. Thus: 


Conservative: Vi+1(s)=R+7-) cen, P(e) Vi(ue) 


Proceedings of the 10th International Conference on Educational Data Mining 


Optimistic: Vj;+1(s)=maxeer, R+y:-P(e)-Vi(ue) 


The former approach was used in the Hint Factory system 
which uses interaction networks to generate data-driven hints 
[15], while the latter is equivalent to a single option MDP [16]. 
Any iteration that maximizes over contracting functions like 
these is, by definition, a contraction mapping [7]. Thus both 
forms will converge over time to a stable value. 


4. OUR EXTENSIONS 


We built several different extensions to the existing BKT model 
that are designed to take advantage of extra information in 
the interaction network to separate tactical knowledge (when to 
apply a skill) from procedural knowledge (how to apply a skill). 


4.1 BKT-SR (BKT Skill Recognition) 

BKT Skill Recognition (BKT-SR) is a semi-binary model that 
predicts students’ behavior on a binary basis but reinforces on 
a more complex paired. In it we maintain two separate BK'T 
models for each skill, one tracks procedural knowledge BK THow, 
and the the other tracks tactical knowledge BK Twhen. BKT- 
SR assumes that the ideal skill will be used only if the student 
correctly recognizes how to apply it, and knows that it is ideal. 


The probability of answering a question correctly is the proba- 
bility given by BK THow multiplied by that given by BK T when. 
The difference between the two models lies in their reinforce- 
ment. BKT How reinforces the skill component of the action 
used, positively if it was used correctly. BK'T when reinforces 
skill component of both the action used AND the ideal action, 
positively if they are the same, negatively otherwise. 


4.2 RKT (Ranked Knowledge Tracing) 


Our environment is not binary, there are almost always several 


‘correct’ options of ranked quality for each state. We there- 


fore introduce the ranked models, RKT and RK'T-SR. These 
models introduce a technique to give a probability distribution 
over a set of ranked options from simpler single skill models. 
The underlying model and reinforcement technique of RK'T 
and RKT-SR is similar to BKT however it can be replaced by 
other comparable models so long as the reinforcement process 
is modified appropriately. ‘This approach gives us a rigorous 
way to aggregate simple learner model predictions into a valid 
probability distribution over all actions. Conceptually, RKT 
tries the best option, if that fails it tries the second best, if that 
fails it tries the third and so on, wrapping back to the first. 


Let x be our current model state and let a;(x) be the probability 
that a student can use the skill required for option 7 given state 
x. Assuming the that the n skill options for a problem are given 
in order, the probability of using the i’” action is 


ae ai(x)TTj— (1—a;(2)) 
ama =) 9 a eC) 


RKT’s underlying model uses a simple two state Hidden-Markov 
Model (HMM) for each skill. State x is a vector of knowledge con- 
fidence. While a;(x) is defined by taking the i'” component as L, 
and then calculating the probability as in standard BKT. RK'T’s 
update function is inspired by Bayes’ theorem but differs slightly 
as our probability function is not linear. An exact, naive imple- 


1d7 


mentation of an HMM would require that we sum over every 


combination of skill knowledge, which is prohibitively expensive. 


To illustrate the update algorithm, suppose skill k& is applied in 
state x, and that x; is the probability of knowing skill 7, and 
u; is x with the j*” skill set to 1. We then calculate the new 
value for skill 7, y;, as: 


_ Pr(ug) a; 
Pr (2) 


After each update we apply our transition function only to the 
ideal skill model. This function is applied in the same way as 
in BKT. Here p; is convex in each argument, so our update will 
keep L between 0 and 1. Further, it will increase L iff knowing 
L will increase the chance of the given action. Thus the update 
is consistent and in the appropriate direction. 


Yj 


4.3. RKT-SR (RKT Skill Recognition) 

Like BKT-SR, RKT-SR tries to separate procedural and tactical 
knowledge using two parallel RKT's, one for how and one for 
when. Like RKT, for state x, let a;(x) denote confidence of 
being able to apply the skill used in option i, and §;(x) denote 


confidence of being able to identify when to use skill of option i. 


In the RKT-SR approach we model the student’s process as 
first noticing a set of options (how skill). Then, of the noticed 
options, they rank them (when skill). And finally they select 
the highest rank action to the best of their ability. ‘Thus the 
probability of doing action 2 is: 


pi(x) = 


ae aj (x yllo@ IT ey) 


fajesciny ies jE[n]\S 
B(x) LoeijesQ — 3; (x)) 
1—|Tje50 — 8; (x)) 


This simplifies to: 


Bes Te a an (x ya-F 
T] (ox (2)(1—Be())**?41—cre(2)) 


—Bi(a))’ +1—an(x)) 


II (ax(a)(1 


Assuming that each 6 is bounded away from 1 and 0, we can 
approximate the infinite sum by taking a fixed number of terms, 
then normalizing it. For the sake of efficiency, we limit the 
number of terms to 3. We believe that RKT-SR has a convex 
probability function like RKT. Thus we update it similarly, with 
how and when updated independently. 


Note that setting all a;=1 in this model yields RKT, as does 
setting all 6; =1. Thus RKT does not necessarily claim that 
either tactical or procedural knowledge is more important, since 
modelling either one with the assumption that the other is trivial 
yields the same model. 


5. DATA SET 


For this analysis we collected data from two semesters of an 
undergraduate Discrete Mathematics course at NCSU where 
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Deep Thought is used. This dataset includes 4 class sections, 205 
students, 2322 problem attempts, and 28640 individual steps. 
Unfortunately the data includes several cases where individual 
events were not logged such as the student entering or exiting the 
program, and cases where events were logged out of order due to 
network issues. While we cleaned these up as much as possible, 
we still include 913 errors in our data that we could detect but 
could not fix. While this missing data may contain important 
information, the average student only had a few such errors, even 
though 148 of the students had some kind of error in their logs. 


In open-ended tutors like Deep Thought, problem-solving errors 
(i.e. incorrect applications) are often treated in one of two ways. 
Either the system records the mistake but leaves it onscreen and 
does not permit it to hinder forward progress. Or the system 
forces the student to fix or retract it immediately. In effect this 
forces the user to always step back to their prior state before 
moving on. Deep Thought adopts this latter approach. Conse- 
quently it is possible to ignore user mistakes in our dataset or to 
recognize them explicitly. With that in mind we tested our mod- 
els with two different interaction networks. One network ignored 
self-loops, thus ignoring mistakes, and the other included them. 


For each state, we ranked the set of derived statements to obtain 
a canonical order. Thus the states are dependant only on what 
was derived, not how or when it was derived. 


5.1 Deep Thought 


Deep thought is an intelligent tutoring system for propositional 
logic. Deep thought has been continually improved with hints 
[15], worked examples [10], and proficiency profiling [9]. The 
system’s assessments have been verified against student test 
scores [8]. Deep Thought uses a GUI to guide students through 
6 problem levels with increasing difficulty. Problems in Deep 
Thought are presented as a set of logical assumptions, and 
a statement which the student must to derive from them by 
applying axioms of propositional logic. 


6. METHODOLOGY 


We first generated the networks using all of the student data. 
This ensured that all actions taken by the students were in- 
cluded in the graph thus simplifying our analysis. This was not 
expected to bias in favor of any model. For the modeling step 
we only calculated the error and reinforced the models based 
upon steps with multiple correct options. 


We used InVis to produce the graphs and perform the value 
iteration [12]. We fixed the value of our goal states at 100, used a 
negative immediate reward for each action of -1, and a discount 
factor of 0.9. Every other state started with a value of 0. 


When measuring error, we focus on the cases where the system 
predicts that that a student will take the ideal action. We use 
a running average as our baseline. For the present we are more 
interested in the relative performance of our models than their 
absolute performance against chance. 


In many states there are two distinct ideal actions that lead to 
different states with the same value. In this case, we want to 
know if a student completed either one. To get the appropri- 
ate probability of an ideal action we calculate the individual 
probabilities of the two ideal actions and, assuming they are 
independent, we then return the probability that either one is 
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performed. This approach works for simpler models like BKT 
and PFA which return per-action probabilities. However it 
may be unfairly penalizing RKT and RK'T-SR, who return a 
complete probability distribution. 


We tested our models using 10 fold cross validation. Each model 
was fit using an exhaustive grid search minimizing RMSE. Final 
metrics were found by calculating the RMSE and AUC for each 
fold, and then averaging them. 


6.1 Applying Binary Models 

BKT and PFA are not designed to handle non-ideal solutions, 
thus their models do not tell us how to reinforce them in this 
case. For each skill, we can reinforce the underlying knowledge 
component of the skill positively (reward), or negatively (punish- 
ment). Thus each model is seen as a black box, where we ”select” 
skills to reinforce, and reward or punish it appropriately. In this 
context we can reinforce the sklls that the student actually per- 
formed as well as the ideal skills, which they may not. Here we 
tested four different versions of BK’T which differ in what skills 
are selected for punishment and which are selected for reward. 


Stock-BKT: This focuses solely on the students’ demonstrated 
skills, ignoring idealness. It selects the skill used and rewards it 
if the action is correct. ActualSkill-BKT: This focuses on the 
students demonstrated skills, but with only the best possible ac- 
tion considered correct. It selects the skill used and rewards it if 
it is ideal. I[dealApp-BKT: Focuses on whether or not the stu- 
dent knows which action is ideal and penalizes them for anything 
else. Selects the ideal skill and rewards if it was used ideally. The 
model makes no change if they performed a correct, but non-ideal 
use of the skill, and it punishes otherwise. IdealActual-BKT: 
Attempts to model both using a joint probability distribution. 
Thus it explicitly conflates knowing when to do something and 
knowing how and then sets a standard of correctness consistent 
with that. Selects both the ideal and the applied skills. If the 
ideal skill is used it is rewarded, otherwise both are punished. 


We chose to reinforce PFA and the running average using the 
same selection model as in ActualSkill-BKT. For reference, 
BKT-SR is equivalent to IdealActual-BKT times Stock-BKT, 
reinforced independently. 


6.2 Plotting Performance 

In order to quickly check for skill acquisition, we developed 
a visualization technique. For each student, we look at the 
opportunities that they had to apply a skill ideally, and whether 
they actually used it. We then plotted these frequencies for all 
students on a single scatter plot. 


Specifically, for each student x, and for each skill k, we make 
vector k*, where the length of k* is the number of times when 
skill k was ideal, with k; 1 if the student used the ideal option 
the ith time k was ideal, 0 otherwise. Let n,(i) be the set of 
skills that were ideal at least i times. Define v” as 


vt kena i) 
Then we just plot each v* together on a scatter graph. For 
comparison purposes we simulated data using BK'T and plotted 
it using this technique. In it, you can see a clear trend. ‘This 


trend is not clearly visible in our real data set. While some 
tweaking of the parameters in the simulated data show slower 
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Figure 2: Simulated Performance 


learning, they still show learning. Even graphs with errors look 
almost identical to the ones shown irrespective of value iteration 
algorithm. Thus this technique, while interesting, is ill-suited 
to detect learning in this domain. 


6.3 Model Fitting 


We fit our parameters using exhaustive grid search. Grid search 
often performs favorably with other fitting methods like EM 
[14]. We define our grid by specifying the upper bound, the 
lower bound, and the number of equal length steps between 
them for each parameter. We chose the parameter bounds so 
that no fit would be degenerate [17]. BKT-SR used the same 
parameters to fit both the when and how subskills, but fits them 
independently to save time. Similarly for RKT-SR. 


We chose the resolution for our grid search model in these cases 
to guarantee a similar amount of time per search, around 2 
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Table 1: Model Fitting Results 


fo Conservative 
Model 
Prwse’ "ave | ruse auc || ruse auc | rwse avc_| 


Optimistic 


No Err Err 


RMSE RMSE 
0.438547 


0.442861 
0.489387 


0.451457 
0.454968 
0.493906 


0.696120 
0.690093 
0.664096 


Average 
PFA 
Stock-BKT 
ActualSkill-BKT 
IdealApp-BKT 
IdealActual-BKT 
BKT-SR 
RKT 
RKT-SR 


0.446208 
0.438583 
0.436518 
0.469820 
0.437331 


0.676102 
0.699546 
0.697695 
0.691284 
0.737032 


0.458204 
0.452686 
0.449347 
0.452071 
0.450763 


0.724183 
0.440841 0.739516 | 0.432296 0.729586 


Conservative 

No Err Err 
RMSE 

0.465104 


0.469697 
0.492487 


RMSE 
0.446898 
0.451412 
0.495561 


0.674632 
0.661166 
0.663387 


0.690875 0.667558 
0.681035 | 0.660922 
0.647382 0.633865 
0.646841 
0.686899 
0.684025 
0.628389 
0.704591 
0.713305 


0.454614 
0.448043 
0.444161 
0.479585 
0.447027 
0.438965 


0.656281 
0.681597 
0.682124 
0.671495 
0.709409 
0.715869 


0.471135 
0.465627 
0.462758 
0.465264 
0.464668 
0.455561 


0.671619 
0.709654 
0.704180 
0.650012 


Table 2: KT Fitting Parameters 
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Table 3: Baseline Fitting Parameters 


Running Avg PFA 
Prior seat sles 


hours, save for RKT-SR, which takes about 5 times as long as 
RKT to run, and takes 10 times as long to fit using our grid 
search. We determined that lowering the resolution any more 
would make fitting ineffective. We expect that the real running 
time could be greatly improved through code tweaks and by 
using a more efficient implementation language. 


7. RESULTS 


The results of the optimistic and conservative value iteration are 
largely equivalent, with every model predicting a little better 
on the optimistic value iteration, including the running aver- 
age. This is likely because the optimistic value iteration favors 
the most frequently used options more than conservative value 
iteration. 


Stock-BKT, the standard how BKT, performed worse then 
any other model across the board. This implies that tactical 
knowledge is more important then procedural knowledge in this 
domain. Surprisingly, removing all error observations does not 
change the performance of Stock-BKT relative to the other 
models. 


ActualSkill- BKT does slightly worse then a running average, as 
does PFA, but IdealApp-BKT, which reinforces the ideal skill 
alone, performs better, trading blows with the running average. 
This suggests that using the wrong skill is more an indication 
that the right skill is not known, rather than that the used skill 
is unknown. Ultimately it appears that they are more important 
together, this is supported by the fact that IdealActual-BKT 
outperforms both the other models and the running average. 
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BKT-SR does not perform as well as its when sub-component, 
IdealActual-BKT. In fact, when we include errors in our data set, 
BKT-SR does significantly worse. The fact that including errors 
did not help Stock-BKT or BKT-SR was a surprise. This seems 
to suggest that failing to use a skill correctly does not always stem 
from not knowing that skill. We suggest that this is actually just 
noise from random guesses. When looking at individual records, 
we find that this is consistent with what we have seen in the logs. 
There we find long stretches where students solve problems in 
order followed by bursts of failed skill applications. Thus the 
extra noise in the how component of BK’T-SR hurts the model. 


But, if we compare the more informed models, RKT and RKT- 
SR, we get a better picture. RKT-SR is the best performing 
model across the board with RKT second in terms of AUC, 
and IdealActual-BKT second in RMSE. RKT and RKT-SR 
incorporate more then just the ideal option, their predictions 
incorporate all of the other skills into the probabilities. Thus 
in BKT terms, the guess and slip are not constant, and they 
depend upon the other options and upon how good the student 
is with them. In line with this, RKT and RKT-SR reinforces 
every applicable skill, not just a few. 


Both RKT and RKT-SR assume that the options are ordered, 
the conceptual difference is that RKT does not distinguish 
between procedural and tactical knowledge. That is enough 
to outdo all our other models (except RKT-SR) in terms of 
AUC. Unlike our simpler models, incorporating both how and 
when information further improves performance, as RK'T-SR 
outperforms RKT. So when and how are both different and 
useful concepts, but separating them takes a little more effort 
then BKT-SR. 


8. CONCLUSIONS & FUTURE WORK 


Open-ended tutoring systems are designed to teach students 
not only how to apply a skill but when to do so. Classical 
student modeling approaches, however, have focused entirely on 
procedural knowledge and generally ignore tactical information. 
In practice it is often difficult to assess whether or not students 
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are gaining this tactical knowledge and prior studies have either 
assumed it or have been content to conflate the two. 


In this paper we address this lack of information in two ways. 
First we sought to visually inspect improvements in tactical 
knowledge. We found that for real student data there is no 
clear or statistically significant indication of improvement. We 
therefore opted to develop novel student models that incorporate 
this information and then to assess their performance on real 
student data. 


In future work we plan to enhance the structure of both our 
experimental and baseline models. Since this project started, 
there have been a number of interesting extensions to BKT, 
such as adding forgetting, and latent student abilities [6]. We 
did not implement these extensions, but they should be directly 
applicable to this context, as well as to RKT and RKT-SR. 


Additionally, Deep thought originally implemented interaction 
networks for the purposes of hint generation [15]. Later im- 
provements saw worked examples incorporated into it [10]. This 
significantly effected student behaviour. Since none of our mod- 
els integrate contextual data, we restricted our data to the 
students that saw no worked examples. In future, we may 
modify the update for the model to incorporate the worked 
examples. This integration of contextual information has been 
done before [13], but in this case it is probably more accurate 
to apply a transition probability. 


Many interactive tutors have solutions that can be expressed as 
an interaction network and thus can be used with these methods. 
These include Andes [18], and Pyrenees [2]. We will seek to 
generalize these results by testing them on datasets collected 
from these tools. 


RKT and RKT-SR are new models which make strong assump- 
tions. In future work we will reevaluate the behavior of these 
models and the underlying assumptions that they make. RKT, 
for example, assumes that quality is ranked, but removing that 
assumption could change the model significantly. 


RK'T gives a valid probability distribution over all options, but 
we have only tested its accuracy in predicting whether the ideal 
action is used. We did not test whether or not it was accurate at 
predicting which of the other actions would be used. This is be- 
lieved to be an advantage of RK'T, but we have not verified that. 
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ABSTRACT 


In this paper we present a novel, data-driven algorithm for 
generating feedback for students on open-ended program- 
ming problems. The feedback goes beyond next-step hints, 
annotating a student’s whole program with suggested edits, 
including code that should be moved or reordered. We also 
build on existing work to design a methodology for evalu- 
ating this feedback in comparison to human tutor feedback, 
using a dataset of real student help requests. Our results 
suggest that our algorithm is capable of reproducing ideal 
human tutor edits almost as frequently as another human tu- 
tor. However, our algorithm also suggests many edits that 
are not supported by human tutors, indicating the need for 
better feedback selection. 


1, INTRODUCTION AND BACKGROUND 
A hallmark of Intelligent Tutoring Systems (ITSs) is their 
ability to support learners with adaptive feedback as they 
work on problem solving tasks. In the domain of open-ended 
computer programming, much research has addressed how 
this feedback can be generated automatically using reference 
solutions [11] or data-driven methods [5, 6, 9]. However, 
existing techniques (including our own work [6]) have two 
notable limitations: the type of feedback they can provide 
and the methods with which they are evaluated. 


Existing work has focused almost exclusively on generating 
next-step hints, suggesting how a student can proceed if they 
get stuck. Next-step hints make sense in the context of a 
structured problem-solving task, with well-defined, discrete 
steps, but they may not always be appropriate in an open- 
ended programming context. Students may request help for 
other reasons, such as to verify that code they have written 
is correct, or to help find a bug in code that does not produce 
correct output. A more comprehensive feedback generation 
algorithm is needed to address these concerns. In this work, 
we present SourceCheck, a novel feedback generation algo- 
rithm that builds on existing work to check over a student’s 
whole program, suggesting useful edits throughout. 


While extensive effort has been put into the generation of 
feedback for programming, efforts to evaluate the quality of 
this feedback are still underdeveloped. Most existing eval- 
uations are either technical evaluations that focus on how 
often hints can be generated and theoretical hint quality 
(e.g. [6, 9, 11]) or small classroom studies that use case 
studies (e.g. [7]). Ideally, we would employ controlled stud- 
ies to evaluate the impact of feedback on students’ course 
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outcomes, as was done by Stamper et al. in their evaluation 
of data-driven hints in the Deep Thought logic tutor [10]. 
However, recent work suggests that programming hints can 
vary widely in quality and that low-quality hints may deter 
students from later asking for help when they need it [8]. 
A meaningful first step would therefore be to better under- 
stand and evaluate the quality of the feedback we generate. 
Piech et al. [5] suggest evaluating automatically generated 
hints for programming by comparing them to “gold stan- 
dard,” expert-authored hints. We build on this method to 
evaluate our feedback algorithm, comparing it to human- 
authored feedback. 


Our initial results show that SourceCheck’s feedback has 
good overlap with that from human tutors. However, 
SourceCheck also produces much more feedback than hu- 
man tutors, and much of this feedback is not represented 
in human tutor feedback. This suggests that SourceCheck 
has good potential but that more work is needed to select 
targeted feedback from potential suggestions. 


2. FEEDBACK GENERATION 


At a high level, SourceCheck works on a simple premise. To 
generate feedback for a student on a given assignment, we 
use a two-step process. First, in the Solution Matching step, 
we look at previously submitted, correct student solutions 
for that assignment and select the one that best matches 
that student’s code. Then, in the Edit Inference step, we 
extract the edits that separate the student’s code from the 
correct solution and present these as feedback. This idea 
dates back to the original Hint Factory [1] and was success- 
fully implemented by Rivers and Koedinger for program- 
ming hints [9]. Rather than changing the fundamentals of 
this idea, we present techniques for improving both steps of 
the process. These improvements center on the understand- 
ing that students’ solutions are diverse and often include 
much correct code that does not directly match a known so- 
lution because of small changes in structure. SourceCheck 
attempts to make use of this code, and can suggest moving 
code in addition to inserting and deleting it. 


SourceCheck takes as input a set of complete, correct prior 
student solutions for an assignment and a snapshot of code 
from a new student requesting a hint. As in previous work, 
we represent both as an abstract syntax tree (AST), a di- 
rected, rooted tree where each node is labeled to represent 
a program element, such as a function call, control struc- 
ture or variable, and the hierarchy of the tree represents 
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how these elements are nested together. To each AST we 
apply simple canonicalization to reduce syntactic complex- 
ity while preserving semantic meaning, as described in [6]. 
SourceCheck outputs a set of edits, (insertions, deletions, 
moves and reorders) that can be used to annotate the stu- 
dent’s code with feedback. While this feedback can include 
next steps hints in the form of insertions, it also highlights 
potential errors and provides reassurance that unannotated 
code is likely correct. 


2.1 Solution Matching 


Most hint generation algorithms for programming select a 
goal solution by finding the “closest” solution to the stu- 
dent’s current code, determined by some distance metric. 
Researchers have used string edit distance [9] and approx- 
imations of tree edit distance [11], though more complex 
metrics have been proposed [4]. The problem with edit dis- 
tances, however, is that they heavily penalize differences in 
the position of code fragments [3, 4]. For example, swap- 
ping the order of two independent subroutines in a program 
does not affect its semantic meaning, but this movement 
is treated as a large set of deletions and insertions by edit 
distance algorithms. 


Mokbel et al. suggest addressing this by fragmenting each 
AST into subgraphs, pairing similar subgraphs from the two 
ASTs, and computing their distance independently [3]. We 
build on this idea, along with our previous work decompos- 
ing ASTs using root paths [6], to produce a distance metric 
designed specifically for code. The root path of a node n in 
an AST is the sequence of node labels on the path from the 
root of the AST to n. Multiple nodes in an AST will have 
the same root path if they and each of their respective an- 
cestors have matching labels, such as two calls to the same 
function in the same block of code. 


Given ASTs A and B, consisting of nodes {a1,...,@)4)} and 
{bi,...,6)p)} respectively, SourceCheck produces a match- 
ing, M = {[a;,b;],...}, pairing nodes from A to nodes from 
B, and a cost C’ for the mapping. Nodes can only appear 
in one pair, and some nodes may be left unmatched. First, 
we iterate over each root path in A, from shortest to longest 
path. For a given root path r, let A, and B, be the set 
of nodes in A and B respectively with root path r. Let us 
define c(n) as the child-sequence of n, or the sequence of 
node labels of the immediate children of n. For each pair of 
nodes a,; € A, and b,; € B;, we compute the pairwise dis- 
tance between their child-sequences, d(c(a,;:), c(b;;)). This 
is used to match nodes with the same root path and similar 
children. 


For the distance function d, we could use a string edit dis- 
tance, such as Levenshtein distance, since AST child-se- 
quences are just sequences of node labels. However, Source- 
Check is designed to match incomplete student code (A) to 
complete solutions (B), so for d we use a “progress” func- 
tion that measures how much of c(a,;) represents progress 
towards c(b,;). Our progress function is similar to an edit 
distance, but it is intentionally asymmetrical and penalizes 
deletions (student’s code not found in the solution) much 
more than insertions (solution code not yet found in the 
student’s code). Additionally, our progress function identi- 
fies insertion/deletion pairs with the same label and treats 
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these as a “reorder”, which has a much lower cost, distin- 
guishing between code that should be deleted and code that 
is out of order. 


SourceCheck calculates the pairwise distances for all nodes 
in A, and B, and then uses the Hungarian algorithm to se- 
lect the set of pairs of minimum total cost, which it adds 
to the mapping, M. This cost is added to C’. This proce- 
dure is performed for each root path in A to determine the 
total mapping and cost. To select a target solution TJ for 
a student’s current code S, SourceCheck simply finds the 
solution with the minimum mapping cost. ‘The result is a 
target solution that maximizes the number of nodes in the 
student’s code which can be reasonably mapped to nodes in 
the target solution. 


2.2 Edit Inference 


Once a target solution J’ has been identified for a student’s 
code S, SourceCheck identifies a set of edits that can bring 
the student closer to this solution. In previous work, this is 
accomplished by selecting the top-level applicable edit [9] or 
following edits from previous students [6]. Instead, we use 
the mapping M between the student’s AST and the target 
solution to calculate a more precise set of edits between S 
and T. These edits take the form of Moves and Reorders, 
along with traditional Insertions and Deletions, determined 
as follows: 


Deletions: First, all nodes s € S without a pair in M are 
marked for deletion; however, these nodes may be reused in 
the final step of the algorithm. 


Moves: Next, we consider all pairs [s;,t;) € M. Let 
P(n) denote the parent of the node n in its AST. If 
[P(s:), P(ti)] ¢ M, this means s; is under the wrong 
parent node, so we mark s; to be moved under p, where 
lp, P(t:)| € M, at an index corresponding to that of t;. If 
no such p exists, this means that the appropriate parent has 
not yet been added to S. We still mark s; for movement, 
but we cannot specify a destination. 


Reorders: Next, we ensure that the children of s; are in the 
correct order. We do this by identifying the set of matching 
child pairs [cs, cz] € M such that P(cs) = s; and P(cz) = ti. 
For each node c;, if the node’s index among its siblings is 
different than that of its pair, c:, we mark it for reordering. 


Insertions: Any node t € TJ’ which has no pair in S is 
marked for insertion. If P(t) has a pair in S, this pair is 
used as a parent. If P(t) has no pair in S, we do not yet 
have a place to insert t. We still mark ¢ for insertion, since 
it may be useful in the next step. 


Combining Insertions and Deletions: If a node is 
deleted in one place and a node with the same label is in- 
serted in another, this may actually represent a Move or 
Reorder. We identify pairs of Deletions and Insertions with 
the same label and replace these with an appropriate Move 
or Reorder. This is a key feature of SourceCheck that en- 
courages a student to use existing code, rather than deleting 
and re-inserting it. 


Using the mapping M, SourceCheck is able to infer more 
semantically meaningful edits, such as Moves and Reorders, 
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which convey more information than their component inser- 
tions and deletions would alone. The Deletions that remain 
indicate likely errors to the students, and the Moves and 
Reorders suggest areas in need of editing. Any node not 
marked with an edit has been “checked” and likely repre- 
sents correct code. 


3. METHODS 


Our evaluation focuses on measuring the quality and appro- 
priateness of SourceCheck’s feedback by comparing it to hu- 
man tutor feedback. This is in contrast to previous technical 
evaluations |6, 9] that used theoretical measures of hint qual- 
ity and availability. Instead, we extend the work of Piech et 
al., who assessed feedback quality by comparing hint poli- 
cies with “gold standard,” human-authored, expert hints on 
small, constrained programming problems (4-6 lines of code 
for an ideal solution) [5]. However, for the more complex 
problems we investigate here (about twice as many lines of 
code), we argue that it is not realistic to define a single best 
“sold standard” hint for a given code snapshot. There may 
be many useful ways a tutor can advise a student, so it is 
more reasonable to measure the similarity of human and 
algorithmic feedback, rather than whether they match ex- 
actly. We build on the “gold standard” method to compare 
the feedback of human tutors and SourceCheck in a more 
nuanced way. We focus on the following research questions: 


RQ1: How well does SourceCheck’s feedback agree with 
ideal human tutor feedback? 


RQ2: How does the agreement between SourceCheck and 
a human tutor compare to the agreement between human 
tutors? 


We evaluated SourceCheck in the context of an introduc- 
tory computing course for non-majors, consisting of 51 stu- 
dents, held at a research university during the Spring 2017 
semester. During the first half of the course, undergradu- 
ate teaching assistants (TAs) facilitated Snap! programming 
labs derived from the Beauty and Joy of Computing (BJC) 
AP Computer Science Principles curriculum [2] (available 
at bjc.edc.org). The course includes three in-lab program- 
ming assignments, completed with TA help available, inter- 
leaved with three homework assignments, completed inde- 
pendently. Students programmed using iSnap’ [7], an exten- 
sion of the block-based, novice programming environment 
Snap! [2]. iSnap supports students working on open-ended 
assignments with data-driven, on-demand hints [6]. 


We selected one homework assignment (Squiral — SQ) and 
one in-lab assignment (The Guessing Game — GG) for anal- 
ysis. In SQ, students draw a square-shaped spiral using 
loops, variables and a custom block (function), and a typ- 
ical solution is around 10 lines of code. In GG, students 
create a simple game in which the player must guess a ran- 
dom number using loops, variables, conditionals and user 
input, and a typical solution is around 13 lines of code. We 
built a dataset of student hint requests on GG and SQ to 
serve as authentic scenarios for evaluating SourceCheck. We 
sampled up to two hint requests from each student. Where 
possible we sampled one request from the first half of their 
working time and one from the second half to avoid overly 


"Demo and datasets available at http://go.ncsu.edu/isnap 
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similar samples. We also ensured that at least 30 seconds 
and one code edit occurred between sampled hint requests. 
We sampled hints from 14 and 15 students on SQ and GG 
for a total of 22 and 29 hints respectively, 51 altogether. 


3.1 Human Feedback Generation 

For each hint request, we extracted a snapshot of the stu- 
dent’s code at the time of the request. Importantly, these 
snapshots represent code for which real students requested 
help, making them an ideal sample on which to evaluate 
SourceCheck. We did so using a post hoc Wizard-of-Oz- 
style experiment. The first two authors, who were familiar 
with the assignments and context, acted as human tutors 
and manually generated feedback for each selected snapshot. 
The two tutors were graduate students in Computer Science 
who were domain experts but not teaching experts, making 
them similar to most course TAs for advanced computing 
courses. 


When generating feedback, the human tutors attempted to 
offer pedagogically useful feedback, but they were limited to 
communicating their feedback using the edits defined in Sec- 
tion 2.2. These edits also had to be independent, meaning 
no edit could be dependent on the student following another 
suggested edit (e.g. suggesting inserting both a for-loop 
and the body of the loop). Tutors crafted these edits with 
the understanding that the edits would be (theoretically) 
presented to students without further explanation or any 
guarantee of further feedback requests. These limitations 
forced the human tutors to generate feedback that could be 
provided through the same user interface that SourceCheck 
would use, as in a Wizard-of-Oz experiment, allowing us to 
directly compare human and algorithmic feedback. Tutors 
generated their feedback based on the student’s current code 
at the time of the hint request, using previous snapshots of 
the student’s code for context. However, tutors did not have 
access to a student’s code after the hint request or the stu- 
dent’s final solution. While the two tutors generated feed- 
back independently, they first practiced on a dataset with 
the same assignments from another semester and compared 
results to ensure a consistent understanding of the feedback 
guidelines. The tutors generated feedback in a two-phase 
process: 


Phase I: Tutors identified the edit(s) they would recom- 
mend to best support the student’s current goal and pro- 
mote learning. The edit(s) should convey a single idea. 


Phase ITI: Tutors envisioned a correct solution that most 
closely matched the student’s current code and identified all 
edits that would bring the student closer to this solution. 


Phase I allows us to measure how well the algorithm repro- 
duces ideal, targeted tutor feedback, addressing RQ1. In 
Phase II, tutors generate a large set of all applicable edits, 
just as SourceCheck does, allowing us to directly compare 
algorithmic and human feedback, addressing RQ2. 


4. ANALYSIS AND RESULTS 


To quantify the overlap between two sets of feedback for a 
given snapshot, we define feedback generation as the process 
of labeling each node of an AST with an edit (Delete, Re- 
order, Move or nothing) and generating a set of Insertions 
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Figure 1: The percentage of tutor Phase I edits predicted by SourceCheck (SC) and TC (the recall) for each 
edit type on the GG and SQ assignments. Bars are labeled with the total number of Phase I edits of each 
type (bottom) and the number of correctly labeled edits (top). 


(each consisting of a type of node to insert and an index 
in the AST at which to insert it). Under this definition, 
we can treat SourceCheck as a classifier and evaluate its 
ability to predict the feedback provided by tutors. We mea- 
sure classification success for each type of edit separately, 
treating it as a binary classification task. For Deletions and 
Moves, we consider each node of each snapshot in our dataset 
to be a classification instance, where both the tutors and 
SourceCheck have labeled the node. Successful classifica- 
tion occurs when SourceCheck produces the same label as 
the tutor. Each Insertion provided by either the tutors or 
SourceCheck for a given snapshot is also considered a clas- 
sification instance, where both the tutors and SourceCheck 
have either included or not included that Insertion. Since 
Reorders were rarely suggested by SourceCheck and were 
never suggested by tutors, we exclude them from analysis. 


Treating feedback generation as a classification task allows 
us to address RQ1 and evaluate the extent to which Source- 
Check agrees with (predicts) the feedback of human tutors. 
The results of this evaluation would be difficult to interpret 
without a baseline for comparison. Therefore, we also de- 
fine a “Tutor Classifier” (TC), which predicts feedback from 
Tutor 1 using the feedback collected in Phase II from Tutor 
2, and vice versa. Since tutors generate a full set of applica- 
ble edits in Phase II, just like SourceCheck, we can directly 
compare the SourceCheck and TC classifiers. This allows us 
to address RQ2, comparing the agreement of human and al- 
gorithmic feedback with that of two humans. We would not 
generally expect an algorithm to predict human tutor feed- 
back better than it would be predicted by another tutor, so 
TC provides a high performance target. 


4.1 Results 

We first look at predicting the targeted feedback that tutors 
provided in Phase I. Figure 1 shows the percentage of the 
tutor edits that were also generated by SourceCheck and 
TC, or the recall of both predictors. We did not observe 
large differences between prediction success for edits gener- 
ated by the two human tutors, so we report their results in 
aggregate. While Deletions were fairly rare, SourceCheck 
performs quite poorly at predicting them on both assign- 
ments. However, SourceCheck predicts 46% and 47% of tu- 
tor Moves and Insertions respectively on GG, and 69% and 
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80% of Moves and Insertions for SQ, where it even outper- 
forms TC. Totalling all edits, SourceCheck had a recall of 
0.45 and 0.57 on GG and SQ respectively, while TC achieved 
0.59 and 0.65. 


An important limitation of recall is that it only considers 
how many of the tutor edits were successfully predicted, and 
not how many “guesses” (suggested edits), it took to do so. 
To understand how much of SourceCheck’s feedback agrees 
with tutor feedback, we must compare it against all tutor 
edits collected in Phase II. Figure 2 shows the recall (top) for 
SourceCheck and TC over all Phase II edits, as well as the 
precision (bottom), or the percentage of SourceCheck and 
TC edits that agreed with human tutor edits. We see very 
similar trends for recall across Phases I and II, implying 
that both SourceCheck and TC predict “ideal” (Phase I) 
and “possible” (Phase II) edits at similar rates. Totalling all 
Phase II edits, SourceCheck had a recall of 0.41 and 0.41 on 
GG and SQ respectively, while TC achieved 0.57 and 0.54. 


However, SourceCheck’s precision is much lower, particu- 
larly for GG, where SourceCheck suggests more of every type 
of edit, for a total of over 50% more suggested edits. Source- 
Check generated on average 10.7 and 6.4 edits per snapshot 
on GG and SQ respectively, compared to 6.3 and 5.2 edits 
per snapshot for the tutors’ Phase II edits. Despite this low 
precision, SourceCheck is not simply suggesting edits every- 
where in the code and getting a few correct by chance. It 
correctly suggests no edit for 1092/1238 (88%) of GG AST 
nodes where the tutors also did not suggest an edit in Phase 
II and for 662/703 (94%) of SQ nodes. Totalling all edits, 
SourceCheck had a precision of 0.27 and 0.38 on GG and 
SQ respectively, while TC achieved 0.57 and 0.54”. 


4.2 A Closer Look 


We manually investigated edits on which the human tutors 
and SourceCheck disagreed, and in this section we present 
some common causes of disagreement: 


Variables: We noticed that many disagreements were over 
variable assignments and references. For example, most of 


*Note that because with TC, Tutor 1 predicts Tutor 2 and 
vice versa, the precision and recall of TC in Phase II will be 
the same, and this value indicates percent agreement. 
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Figure 2: The recall for Phase II edits (top), as well the the percentage of SourceCheck (SC) and TC edits 


that agreed with a human tutor (the precision, bottom). 


the Phase I deletions that SourceCheck failed to predict were 
instances of a tutor deleting a variable, such when a student 
used the wrong variable in an expression. This is largely 
due to SourceCheck’s canonicalization process [6], which 
currently gives all variables the same label, making them 
indistinguishable. ‘This simplification makes code matching 
easier, but clearly a more robust solution is needed. 


Supporting Unusual Code: Many times, a tutor sug- 
gested an edit, such as deleting an unneeded control struc- 
ture, that would lead the student away from a potentially 
confusing program state. In many of these cases, Source- 
Check found some solution which used this unusual code 
correctly and instead suggested how the student could do 
the same. We view this behavior as a design choice, rather 
than a flaw per se, but it is worth investigating when this 
behavior would lead to its intended effect of supporting un- 
conventional solutions, and when it would lead to confusion. 


Code Variability: The assignments we analyzed were com- 
plex enough to allow the student to make a number of small 
design choices, such how to reset the sprite and canvas be- 
fore drawing the “Squiral” in the SQ assignment. Often, the 
tutor and the target solution chosen by SourceCheck made 
different, correct suggestions. This also occurred between 
human tutors, emphasizing that disagreement with the tu- 
tors does not always indicate poor feedback. 


Human Traits: Sometimes the human tutors were able 
to infer information from natural language in a student’s 
code that influenced their feedback in a way that would not 
be possible for SourceCheck. For example, the name of a 
variable might imply how it is intended to be used (e.g. 
“randomNumber”). This sometimes led to very different ed- 
its from SourceCheck and the human tutors. On the other 
hand, humans are also capable of making careless errors, 
and our tutors sometimes simply forgot to suggest a small, 
useful edit in Phase II, which SourceCheck remembered. 
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5. DISCUSSION 

RQ1: How well does SourceCheck’s feedback agree with ideal 
human tutor feedback? SourceCheck agrees with approxi- 
mately half of ideal tutor feedback provided in Phase I, al- 
most as much as another human tutor, with SourceCheck 
achieving a recall 76% and 88% as high as TC on GG and 
SQ respectively. This does not necessarily mean that Source- 
Check’s feedback is almost as good as a tutor’s. It is possible 
that when SourceCheck’s feedback diverges from a tutor’s, it 
does so in a less useful way than when another tutor does so; 
however, this is difficult to investigate without some direct 
measure of hint quality (e.g. [8]). For now, we can say that 
these results suggest good potential for data-driven feed- 
back generation, in that ideal tutor feedback is frequently 
contained in the set of edits generated by SourceCheck. 


RQ2: How does the agreement between SourceCheck and a 
human tutor compare to the agreement between human tu- 
tors? Our results for RQ2 are mixed. In Phase II, Source- 
Check was 72-76% as likely to agree with a given tutor’s 
edit as another tutor was on GG and SQ (as measured by 
recall). However, a given tutor was only 47-70% as likely to 
agree with SourceCheck’s edit as with another tutor’s edit 
(as measured by precision). This is largely because Source- 
Check generated more total edits than the tutors did, espe- 
cially on GG. This lack of precision seems to be the largest 
difference between SourceCheck and human tutors. Even if 
SourceCheck can produce quality feedback, the benefit to 
the student might be lost if it is hidden among less useful 
suggestions. Additionally, recent work suggests that stu- 
dents seek less help after receiving poor quality hints [8]. 
A critical direction for future research will be how to select 
feedback once a set of possible edits has been generated. 


It is also worth noting that our two human tutors had rel- 
atively low agreement. Comparing all suggested Phase II 
edits, we see that they have a 54% and 57% agreement on 
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the GG and SQ assignments respectively. In fact, tutors 
only agreed completely on 8 out of 22 SQ snapshots (36%) 
and 7 out of 29 GG snapshots (24%) in Phase I. This sug- 
gests the assignments we studied truly are open-ended, since 
tutors often disagreed on the best path forward, though we 
cannot make any strong claims using our human data be- 
cause it was generated by the authors. ‘This supports our 
choice to measure agreement using the similarity of edits, 
rather than using a single, best “gold standard” hint, as was 
done by Piech et al. on simpler assignments [5]. 


6. CONCLUSION 


In this work, we have presented SourceCheck, an algo- 
rithm for automatically generating data-driven feedback for 
students working on open-ended programming problems. 
SourceCheck builds on existing methods [3, 9] to improve 
the processes of selecting a target solution from a set of 
correct solutions and inferring edits to get the student to 
that solution. It does so with a code-specific matching 
function and more semantically meaningful suggested edits: 
Moves and Reorders. We have also presented a method for 
evaluating automatically generated feedback by comparing 
it to feedback generated by human tutors playing the same 
role. We extend existing methods [5] by using a dataset 
of real student help requests to ensure authenticity and by 
formulating the problem as a prediction task, allowing us 
to compare the similarity among an algorithm and multiple 
human tutors. This allows us to envision the high standard 
of an algorithm as similar to human tutors are they are 
to each other. We show that SourceCheck approaches this 
target in some ways and falls well short in others. 


Based on our results, isnap has been updated to include 
SourceCheck feedback, and we envision a number of practi- 
cal application for the algorithm. In busy classrooms, large- 
scale MOOCs and informal learning settings, instructors are 
often absent or unavailable. The on-demand feedback pro- 
vided by SourceCheck can keep students going when they 
get stuck and would otherwise give up. SourceCheck could 
also be used to identify potential struggling students in real- 
time, based on their distance to a known solution. Both 
SourceCheck and our evaluation methodology were designed 
to scale to the larger, more complex programming problems 
found in real classrooms. This will require SourceCheck to 
support a greater diversity of student code, which will re- 
quire a larger dataset of correct solutions for matching. 


This work also has clear limitations. We only used two tutors 
to generate human feedback, and the authors who served as 
tutors were not pedagogical experts and had limited teach- 
ing experience. While their experience is on par with many 
graduate computing TAs, results may be different with expe- 
rienced teachers. Additionally, despite efforts at objectivity, 
the tutors’ familiarity with each other and with SourceCheck 
may have biased their feedback. Our work is also limited by 
the small sample of assignments and hint requests we inves- 
tigated, especially given that our results were quite different 
for GG and SQ. Finally, the methods presented here do not 
lend themselves to traditional statistical testing, making it 
difficult to make claims about true differences in recall and 
precision. Our methods only speak to the relative similarity 
of algorithmic and human tutor feedback, but this does not 
directly assess feedback quality. 
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This work opens many avenues for future work. Our results 
suggest a number of ways SourceCheck could be improved, 
such as a method for selecting which of the generated ed- 
its are most useful to show the student. Future work could 
also explore how to expand data-driven IT'S feedback for 
programming beyond edit-based hints, towards richer de- 
scriptions or explanations. Our results also raise questions 
about the consistency of human feedback on open-ended pro- 
gramming problems, and future work should determine how 
much agreement can be expected among human tutor feed- 
back. Lastly, the methods presented here can be used to 
evaluate, compare and benchmark other feedback genera- 
tion techniques, giving researchers a better understanding 
of their strengths and weaknesses. 
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ABSTRACT 


Understanding how individuals interact with a course after 
receiving a passing grade could have important implications for 
course design. If individuals become disengaged after passing a 
class, then this may raise questions about optimal ordering of 
content, course difficulty, and grade transparency. Using a person- 
fixed effects model, we analyze how individuals who obtained 
passing grades subsequently behaved within a course. These 
learners were less likely to complete videos and more likely to 
watch videos faster after receiving notice of a passing grade in the 
class. These learners were also less likely to reattempt items they 
initially got wrong. 


Keywords 


Video-interactions; grading schemes; learning analytics; MOOCs; 


1. INTRODUCTION 


Grades are a key component of online courses. However, there is 
a great deal of heterogeneity in the downstream effects of grading 
and grading schemes. For instance, female students who received 
an ‘A’ in their introductory economics courses were substantially 
more likely to major in the subject than individuals who received 
a B but had similar scores in the class [1]. On the other end of the 
spectrum, research suggests that pass-fail grading schemes may be 
beneficial in terms of student stress in high-stakes environments 
[2]. Other work suggests that the presence of pass-fail grading 
discourages student performance[3]. 


MOOCs offer a unique opportunity to understand how grading 
affects within-course behavior. First, clickstream data can 
document subtle changes in behavior that are reasonable proxies 
for engagement and effort (e.g. video consumption, video 
interactions, multiple attempts on items). Second, compared to 
traditional courses, grading in MOOCs is much more salient and 
immediate. Grades are recomputed instantaneously, and solutions 
are presented after every single problem. 


Understanding how individuals interact with a course after 
receiving a passing grade could have important implications for 
course design. If individuals disengage after passing a class, then 
it may make sense to structure a course such that final grades are 
not revealed until all problems have been attempted. 
Alternatively, if individuals exert more effort in a class after 
reaching passing status, then perhaps courses should be designed 
with gamification/scaffolding in mind such that a learner is 
continually working for a new certificate/badge. 


2. DATA 


The dataset used in this analysis was an introductory course in 
Statistical Learning administered multiple times via Stanford’s 
Lagunita Platform. 55,000 individuals enrolled in the class. Of 
that population, 11,301 individuals interacted with both course 
videos and with assignments related to the course at least once. Of 
these individuals, 2,485 achieved certification. 


The course includes 77 videos. The cumulative length of these 
videos is 15.3 hours. We used the clickstream created by learners 
who viewed the course via the Lagunita platform. Clickstream 
events are generated each time a video is loaded, finished, played 
or paused, fast forwarded or rewound. Other clickstream activities 
include changes to the media player’s playback speed to one of 
six settings (0.5X, .75X, 1.0X, 1.25X,1.5X, and 2.0X). These 
activities were aggregated on a user-video level. In total, there 
were 126,799 learner-video observations. 


2.1 Course Items 


The course assignments consisted of 103 multiple-choice, short- 
responses, and fill-in-the-blank items. Learners who answered at 
least 50% of all items correctly received a certificate. Individuals 
who obtained a score of 90% or more received a certificate of 
distinction. We limited the dataset to include only individuals 
who attempted at least a simple majority of items. 


3. ANALYSIS 


In this course, learners are keenly aware of the grading cutpoints. 
The distribution of learners’ scores show substantial jumps in 
density at just above 50% and just above 90% (red lines), as seen 
in Figure 1. In an educational context, such jumps usually indicate 
a bias on the part of graders to give students with marginal scores 
the benefit of the doubt [4]. In this instance, though, all exams are 
graded electronically, and this type of manipulation by a grader is 
not possible. Instead, this heaping likely reflects a subset of 
learners who are extremely motivated by the certificate, and cease 
attempts after obtaining it. In this case, we identified that 
approximately 5% of students stopped attempting items shortly 
after they hit the 50% threshold. Formal evaluation via the 
McCrary Density! tests rejects continuity of the density function 


' The McCrary Density Test estimates the continuity of exam 
scores at the cutoff using local linear regression. If the left and 
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Density 


at the cutoff scores with a t-statistic of 7.1 and 5.1 at the 50% 
percent and 90% percent thresholds [5]. 


Percent Score in The Course 


ii 
Hi 


, 


Percent Gee 


Cumulative Probability 


Figure 1 Histogram of Course Scores 


Given how pronounced and precise this heaping was, we 
examined the grade-reporting interface. If a user clicks on a link 
to their progress, a report is generated with a user’s score on each 
exam, as well as their overall status with the course, indicating 
whether they have currently passed the course. Figure 2 depicts a 
mock-up of these reports. There are several noteworthy features of 
this reporting format. First, these grading thresholds are very 
clearly identified by their shading. Light grey depicts the region 
that is considered passing (>50%) and dark grey depicts the 
region that is considered passing with distinction (>90%). A 
learner’s grade is communicated by their total score bar (right 
most column). If this bar is at 50% or more, they will be able to 
observe the top of the total score bar in the light grey region, 
indicating that they passed. If the total score bar is in the dark 
grey region, this indicates the learner has earned a certificate of 
distinction. On top of these features, the total score is computed 
and displayed in percentages terms, making the learner’s grade 
relative to the passing threshold eminently clear. In this artificial 
example, the learner obtained a 100% on every item but stopped 
almost immediately after obtaining a passing grade in the course. 


Figure 2 Example Grading Report 
This reporting format could help explain the popularity of grade 
checking behavior in the course. Ninety-eight percent of learners 


checked their grades at least once, and the median user checked 
their grade 32 times during the course of the class. 


3.1 When Passing Occurs 


There is considerable variation in when a learner passes a course. 
Our identification strategy leverages within-learner variation 
before and after they became aware they passed the course. Figure 
3 shows that of the 2,485 learners who passed the course, the 
median individual tends to do so within the first 70 items. This 


right hand-side estimates produce substantially different 
estimates, it would suggest manipulation or selection into one of 
the two groups. 


leaves almost a third of the course and its items to serve as a 
behavioral contrast. We also exploit variation of when students 
become aware they have passed the course. Approximately 70% 
of individuals, checked their grade on the day that they passed a 
course. Others realized this information at a later date. 


CDF: Percentage of Certificate Earners 
1 Se 
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Figure 3 Item on which an Individual Obtains a Certificate 


3.2 Impact of Passing on Engagement 

We estimate user engagement by analyzing video interactions 
before and after a learner receives notification that they have 
passed the course via person fixed-effect regression. The 
specification is below: 


UserBehavior, = B,PassNotification;; + T; + ej; 


The [, denotes the person-fixed-effect and the user behavior/pass 
notification refers to the ith person’s performance on their jth 
video. Outcomes include playback speed, fast forwarding, and 
video completion. For the purposes of this analysis, we define 
video completion as a student completing 90% of a video. This 
threshold was chosen as these videos often contain summaries, 
production details, and end titles in the last minute or so of 
content. 


Our first analysis suggests that individuals sped up after passing a 
course. The first column of Table 1 corresponds to a univariate 
regression model of playback speed on pass notification. The 
second column corresponds to a regression model of playback 
speed on pass notification and a person-fixed-effect. The third 
column also includes a time trend that accounts for how many 
days a student has been enrolled in a course at the time of their 
video interactions. After accounting for person-fixed effects, our 
preferred regression model (Column 2) finds individuals speed up 
on average about 1%. Given that playback speed has six discrete 
speeds (0.5X, 0.75X, 1.0X, 1.25X, 1.5X, 2.0X) this speed-up 
reflects a subset of learners adjusting their playback speed on a 
subset of videos that they interacted with rather than a gradual 
shift across all videos. Depending on how early a learner obtained 
a passing grade for the course, this speedup represents as much as 
a 10-minute reduction on time spent watching videos over the 
remainder of the course. In terms of effect size, this increase 
corresponds to roughly an increase of .05 of a standard deviation. 
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Table 1 Effect of Pass Notification on Playback Speed 


(1) (2) (3) 
Univariate Person Effects Time Trend 
Pass Notice 0.0184" 0.0107" 0.00607** 
(3.68) (5.17) (3.04) 
Log Days 0.00412°"* 
(5.29) 
Constant 1.080°°* 1.082°°* 1.070°°* 
(298.23) (2318.28) (438.87) 
Observations 126799 126799 126799 
Adjusted R? 0.002 0.776 0.776 


t statistics in parentheses 

“p< 0.05, p < 0.01, “™ p < 0.001 

Other video behaviors suggest that individuals may be less 
engaged in the course after receiving certification. Modeling the 
effect of receiving a passing grade on fast forwarding behavior 
suggests that passing notification is associated with a 4-5% 
percentage point reduction in fast forwarding and a 3-4% 
percentage point reduction in rewinding. A decrease in fast 
forwarding behavior may be seen as a form of increased 
engagement by some. However, it should be noted that fast 
forwarding and rewinding are symmetric actions (The 
concordance within video between rewinding and fast forwarding 
is 73%). 

When answering a question on an assignment, a very common 
learner strategy is to review prior material. If a user is searching a 
video for a particular statement or graph, a learner is unlikely to 
skip to exactly the right point in time. Even if they were, learners 
may like to check the immediately preceding and following slides 
for context or clarifying information. In these cases, one would 
expect to see both fast forwarding and rewinding. Most of the 
reduction in rewinding and fast forwarding seems to come from 
cases like these. In terms of total effect size, these reductions 
correspond to a .10 reduction in fast forwarding and a .06 


reduction in rewinding. 


Table 2 Effect of Pass Notification on Fast Forwarding (Top) 
and Rewinding (Bottom) 


(1) (2) (3) 
Univariate Person Effects Time Trend 
Pass Notice -0.0430"" -0.0419°"* -0.0496""* 
(-7.59) (-11.20) (-12.03) 
Log Days 0.00682°** 
(4.07) 
Constant 0.259°"" 0.259°"" 0.239°"" 
(64.27) (307.34) (48.20) 
Observations 126799 126799 126799 
Adjusted R? 0.002 0.180 0.181 
(1) (2) (3) 
Univariate Person Effects Time Trend 
Pass Notice -0.0258""" -0.0303°"" -0.0418""* 
(-4.08) (-7.05) (-9.10) 
Log Days 0.0102" 
(5.73) 
Constant 0.393°"" 0.394""" 0.364°"" 
(78.88) (407.29) (68.39) 
Observations 126799 126799 126799 
Adjusted R? 0.000 0.200 0.201 


t statistics in parentheses 


*p <0.05, “ p < 0.01, “ p < 0.001 


We also note the percentage of videos that are completed 
decreases after pass notification. Here we find that individuals are 
less likely to complete videos after passing a course by 
approximately five percentage points. This corresponds to 
approximately .15 of a standard deviation. 


Table 3 Effect of Pass Notification on Video Completion 


(1) (2) (3) 
Univariate Person Effects Time Trend 
Pass Notice -0.0226°° -0.0498""* -0.0425°"" 
(-5.12) (-14.69) (-11.99) 
Log Days -0.00646°™* 
(-4.45) 
Constant 0.858" 0.864°"" 0.882°°" 
(303.85) (1130.74) (200.95) 
Observations 126799 126799 126799 
Adjusted R? 0.001 0.182 0.182 


t statistics in parentheses 
“p< 0.05," p < 0.01,“ p < 0.001 


Finally, we examine the number of attempts individuals made to 
answer items. We find that individuals who passed the course 
were subsequently less likely to make multiple attempts on 
incorrect items.” Before passing, there were an average of 1.11 
attempts. After passing, this declined to 1.07 attempts. This 
corresponds to an effect size of approximately of .07. 


Table 4 Effect of Pass Notification on Item Attempts 


(1) (2) (3) 
Univariate Person Effects Time Trend 
Pass Notice -0.0361°"" -0.0394""" -0.0316°°" 

(-7.60) (-6.98) (-4.73) 

Log Days -6.702° 
(-2.41) 

Constant 1.114" 1.114" 67.54" 
(427.81) (1888.66) (2.45) 

Observations 113562 113562 113562 
Adjusted R? 0.001 0.203 0.203 


t statistics in parentheses 
“p< 0.05," p < 0.01,“ p < 0.001 


3.3 Limitations to Analysis 

This study was conducted on a single MOOC. It should also be 
noted that this MOOC was a terminal course. This course was not 
part of a broader sequence and its content was not necessary for 
other courses available within the platform. A such, our findings 
that users disengaged in course material after passing the course 
may not generalize. 


4. DISCUSSION 


On balance, our findings suggest that passing notification 
discourages subsequent engagement for at least a subset of users. 
We see increases in playback speed and less video completion. 

These findings are consistent with evidence from the educational 
psychology and behavioral economics literature, which has 
suggested that receipt of a certificate or badges can discourage 
intrinsic motivation in individuals [6][7]. Earlier work in MOOCs 


Observations differ in this specification because it is based on 
person-item level data rather than person-video level data. 
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also found that individuals who obtain certificates in courses 
actually skipped nearly a quarter of a course’s video content [8]. 
We have documented several learner behaviors that are relevant to 
the design of MOOCs, and likely the design of online teaching 
more generally. 


With respect to grading schema, there is substantial evidence that 
individuals act in a more engaged manner before passing a course 
than after they have received a notification of passing. We also 
see this strategic behavior in that there are substantially more 
students just above the passing threshold than just below it. 


One policy implication of these findings is how quickly learners 
should be notified about their overall success in a course. 
Currently many courses notify learners instantaneously, daily, or 
on a near weekly basis when these events occur. For courses with 
a well-defined end date, it may make sense to not notify users of 
their final grades until the course is completed. 


A second consideration is how transparent instructors should be 
in terms of grading. Learners could not manipulate their grades as 
easily if they did not know the exact threshold for passing. Using 
language that describes approximate cutpoints may discourage 
learners from conflating certification and completion while 
allowing for more rigorous causal inference. 


Lastly there is the question of course structure, if individuals put 
forth less effort after passing a class, then perhaps a more 
traditional instructional environment of weekly assignments with 
a summative final project or exam may yield more total learning. 


5. FUTURE STEPS 


We found that notification of a passing grade decreased 
subsequent effort in the same course. An equally intriguing 
question is how individuals who are enrolled in multiple classes 
behave after this notification. If these individuals are solely 
interested in accumulation of credentials or certificates, 
presumably we would see effort shift to courses where learners 
have yet to obtain certificates. 
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ABSTRACT 


In this work, we describe a new statistical method to improve the 
detection of treatment effects in interventions. We call our method 
TAME (Trained Across Multiple Experiments). TAME takes 
advantage of multiple experiments with similar designs to create a 
single model. We use this model to predict the outcome of the 
dependent variable in unseen experiments. We use the predictive 
accuracy of the model on the conditions of the experiment to 
determine if the treatment had a statistically significant effect. We 
validated the effectiveness of our model using a large-scale 
simulation study, where we showed that our model can detect 
treatment effects with 10% more statistical power than an 
ANOVA in certain settings. We also applied our model to real 
data collected from the ASSISTments online learning platform 
and showed that the treatment effects detected by our model were 
comparable to the effects detected by the ANOVA. 


Keywords 


Intervention Effectiveness; Randomized Controlled Experiments; 
Meta-Analysis; ANOVA; Treatment Effect; TAME; 


1. INTRODUCTION 


The goal of this paper is to develop a method that can more 
effectively detect treatment effects in randomized controlled 
experiments that are run inside online tutoring systems. Common 
methods for analyzing these experiments include existing 
statistical tests such as a T-Test, regression, and an Analysis of 
Variance (ANOVA). Although these analysis methods are 
typically used, there are disadvantages that must be considered. 


Grossman et al discuss several disadvantages of randomized 
controlled experiments [4]. One disadvantage is having a small 
sample size compared to the number of variables and it is unlikely 
that there will be an equal balance of variables in the control and 
treatment groups of the experiment. Another disadvantage is that a 
single study may not be able to infer the overall treatment effect 
on the entire population. The treatment may have different effects 
on different subpopulations, experiments settings may be different, 
and there may also be several different dependent measures to 
consider. There also may be a large number of experiments where 
the reported effects are false due to Type I error. 


We hope to ameliorate several of these issues by using a technique 
that combines data from _ several randomized controlled 
experiments in order to build a model to estimate the difference 
between conditions in experiments. Advantages of combining data 
from multiple experiments include increasing the sample size, and 
also reducing the variance for better confidence estimates [1]. 


Two major questions to consider when pooling experiments are 
discussed in [1]. The first question is, “Which experiments should 
be combined for analysis?”, and is considered “the most serious 
methodological limitation” [3]. Experiments should be combined 
if they have similar research questions, populations, experiment 
settings, intervention components, implementation, and dependent 
measures. In our paper we select experiments with the same 
dependent measures and study design format (A/B). 


The second question is how to combine experiments once they are 
chosen for inclusion. One method, called Jumping, combines all 
the data into a single data set, ignoring the differences among the 
experiments. Another method called pooling, combines 
experiments into a single data set but adjusts for differences in 
experiments [1]. In our case, we have experiments that can have 
very different effect sizes. We applied the pooling method, but 
instead of applying standard meta-analysis techniques, we trained 
a linear model to predict the outcome measures. 


Our goal is to use our method called TAME (Trained Across 
Multiple Experiments) to more effectively detect treatment effects. 
We use data from multiple experiments to increase the power of 
the model, and to utilize linear regression to model subject 
outcomes for treatment effect detection. We hope that TAME 
would also reduce the bias of meta-analyses in efforts to improve 
the reliability of statistical results. 


The data we use comes from a data set previously collected and 
synthesized from twenty-two randomized controlled experiments 
run inside the ASSISTments online tutoring system [5]. These 
experiments were proposed by internal and external researchers on 
a large variety of topics. The student population consists of mostly 
middle-school students ranging from grades 6-8. All experiments 
had a single control group and a single experiment group (A/B 
study design) with at least 50 students in each group. A total of 
102,252 problems were attempted by 8,297 students across 22 
different experiments. 


We conducted a large-scale simulation experiment to compare the 
accuracy of TAME to the accuracy of an ANOVA under different 
experiment settings. To determine how well each method 
performed we looked at the chance of detecting an effect when 
there really is one (true positive) and the chance of not detecting 
an effect when there really is not one (true negative). This is 
conversely related to Type I and Type II errors. Our research 
questions are 1) Does TAME perform better than the ANOVA 
method? 2) Under what circumstances do TAME perform better? 
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Table 1. Parameters, value ranges, and an example of a setting 


Possible Parameter Values 


Expr. in a Group 2,4, 6, 8, 10, 12, 14, 16, 18, 20 
Expr. with Diff. [O, n], n = number of expr. in grou 


2. TAME Model 


TAME borrows the idea of meta-analysis, where many 
experiments are used to report on generalized effects. The main 
concept of TAME is to first model the outcome measure in the 
absence of the condition assignment. Any other factors can still be 
used in the creation of the model. To do this, one must use data 
outside of the experiment of interest (the “test” experiment) to 
ensure that the model does not overfit to the test experiment. By 
training a model on a collection of similar experiments, it is less 
likely that the model will overfit to any given experiment. For the 


0.05, 0.1, 0.15, 0.2, 0.4, 0.6, 0.8, 1 


rest of this paper, we will refer to a group of similar experiments 
as an experiment group. 


For each experiment in an experiment group, we first train a linear 
model on all of the other experiments in the same group, using all 
factors in the data set except the condition assignments in the 
experiments. Note that the model used does not have to be a linear 
model and other types of models will work as well. Once a model 
is trained, it is applied to estimate the dependent measure of the 
test experiment. Then, we compute the residual value for each 
subject in the test experiment, which is the actual outcome 
measure minus the modeled outcome measure. Assuming that all 
other factors that may affect the outcome measures are accounted 
for in the model, the only cause of the residual values must be the 
condition assignments and noise. A two-tailed unpaired T-test is 
performed on the residual values of the samples from the control 
group and the treatment group in the test experiment to determine 
if there is a significant treatment effect. If the T-test reports that 
there are significant differences, we claim that the effect of the 
intervention was statistically significant. 


The sign of the residual matters for our usage of the model, which 
is contrary to most modeling approaches, where the absolute or 
squared residuals are analyzed. If the residual is positive, it means 
that the student overperformed the model due to some factors that 
the model does not account for. Those factors positively affect the 
student outcome measure and could be attributed to helpful 
interventions. If the residual is negative, it means that the student 
underperformed the model, which may be caused by harmful 
interventions. We believe the reason that our method will result in 
a better estimate of treatment effects 1s because training on all 
experiments except for one, without knowing the conditions of the 
experiment, will generate a less biased model than an ANOVA, 
which operates on a single experiment and includes the condition 
of the experiment while training the model. 


3. SIMULATION EXPERIMENT 


Simulated data are often used in the EDM community as well as 
other research areas to validate models , such as [7]. One 
advantage of using simulated data is that the ground truth values 
are known, which make it possible to compare the learned values 
to the true values. Another advantage of using simulated data is 
that it gives us the ability to control for and test any combinations 
of parameters. To evaluate the effectiveness of our model, we ran 
a large scale simulation experiment to compare the accuracy of 
treatment effects detected by TAME to the accuracy of treatment 
effects detected by an ANOVA. For both methods, we used a 
between-subject ANOVA (type III SS) to compare the main 
effects of the condition variable on our dependent measure using 
all other factors as fixed factors. We looked at the percent of 
treatment effects correctly detected (true positive, p<0.05) and 
incorrectly detected (false positive). Our simulation data was 
generated using Java code and the models were trained and 
evaluated using R. 


3.1. Data Generation 

The parameters we experimented with and their possible values 
are summarized in the first and second column of Table 1, while 
the third column shows an instantiation of values for an example 
experiment setting. Ten trials of experimental data were generated 
for all combinations of parameters resulting in over ten million 
trials generated. 


Experiments in a Group: This parameter represents the number of 
experiments in a group. We chose to sample groups in the range of 
[2, 20] experiments in increments of two because we believe this 
is a realistic number of experiments that could be analyzed 
together. Several recent meta-analysis papers publish data with the 
number of studies ranging from 12 - 217 [2, 5, 9]. It is also 
reasonable to have this many experiments with a similar designs, 
which can be analyzed together. Our analysis of real data includes 
a dataset consisting of 22 experiments reported in [5]. 


Experiments with Differences: This parameter is number of 
experiments where there is a difference in the outcome measure 
between the control and treatment group. This value ranges from 
having no experiments in group with differences to having all the 
experiments within a group with differences. All experiments that 
have a difference between the control group and the treatment 
group all have equal effect sizes. 


Samples: This parameter is for the number of samples assigned 
into a given experiment. In the context of the EDM community, 
the number samples is equivalent to the number of students that 
have participated in an experiment. We chose to simulate data for 
a number of students in the range of {20, 40, 60, 80, 100, and 
200} because we believe this range consists of values for a typical 
number of students expected to participate in most experiments. 


Factors: The number of factors for all experiments within an 
experiment group. The condition of the experiment is considered a 
special factor and is not grouped with the other factors. All factors 
are categorical variables. Factors are used to represent features of 
the student such as gender or levels of prior knowledge, which 
have been shown to improve predictive modeling [8]. We add 
features to the generated data to more accurately simulate a real- 
world scenario. We assume the features do not correlate with the 
intervention, and therefore do not have interaction effects. 


Values per Factor: This parameter represents the number of 
categorical values that all factors can subsume. For example a 
factor with two values could represent the gender of a student or a 
factor with several values could represent the prior knowledge of 
the student discretized into several bins. 


Effect Size: The effect size measured with Cohen’s D. Both 
smaller ranges of differences and larger ranges of differences were 
tested for both practical and theoretical contexts. In practice many 
experiments report small effect sizes; therefore we test in the 
range of [0.05, 0.2] in increments of 0.05 to simulate what would 
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Table 2. A concrete example of simulated data 


Experiment Sample - Condition Factor 1 Base Outcome | Final Outcome 


a ee ee 


Se RO OS YS 


happen in a likely scenario. We also use values from [0.2, 1.0] in 
increments of 0.2 for larger differences to observe what would 
happen in a best-case scenario with a large difference in means. 


Table 2 shows an example of what the data generated under the 
example setting in Table | looks like. The first column in Table 2 
shows what experiment each sample belongs to. In this example 
there are only two experiments. Each experiment in this example 
has twenty samples each, however only two samples are shown for 
both experiments in Table 2. The sample column represents a 
unique sample number for each experiment. In the context of an 
experiment, the sample number represents the student. The 
condition column represents what condition the sample is assigned 
into. The condition is uniformly and randomly chosen between 
either “A”, or “B”, where “A” represents the control group and 
“B” represents the treatment group. Each condition has a value 
associated with it, which is equivalent to the effect of the 
treatment. Table 2 shows that in this example, the intervention has 
an effect size of 1.0 standard deviation. Therefore the condition 
value is set to 1.0 where the condition is “B” (treatment), and the 
condition value is set to 0 where the condition is “A” (control). 


Each factor in the experiment has a column for the categorical 
value of that factor and a value for how that factor value affects 
the dependent measure of the experiment. Since there is only one 
factor in this experiment setting, there is only a single factor 
column (“Factor 1”) shown in Table 2. This column can hold three 
values (“A’”, “B”’, or “C’’), because the number of values per factor 
is set to three in this experiment setting. Each factor value is 
generated randomly and uniformly for each sample. The value for 
how the factor effects the dependent measure is randomly 
generated from a standard normal distribution (u = 0, o = 1.0) with 
Gaussian noise added to the value for each sample for a more 
realistic simulation. The noise is generated from a normal 
distribution with the mean centered at the randomly generated 
value for the factor with a standard deviation of 0.25. In Table 2, 
this can be seen by looking at rows 1, and 4, which are assigned to 
factor “A”, where all the values for this factor are close to 0.4. In 
this example the randomly generated effect of factor “A” is 0.4 
with noise added for each sample. In the context of educational 
data mining, certain features of the student can have effects on 
learning gains which may vary slightly for each student. 


The base outcome value is a random number chosen from a 
normal distribution (u = 0, o = 1). This number represents how a 
random sample performs. The final column represents the 
dependent measure in experiments. This value is the sum of the 
base outcome values, all feature values, and the condition value. 
For example, row 2 has a condition value of 1, a factor value of 
0.1 and a base outcome value of 0.1. Therefore the final outcome 
value is 1 + 0.1 + 0.1 = 1.2. This representation may be thought of 
as the average learning gains a student has when comparing their 
pretest score to their posttest scores. We do not have an explicit 
dependent measure and will refer to it in the general context. 


4. SIMULATION RESULT 


To analyze our results we calculated the mean true positive rate 
and false positive rate at the experiment group level. Each 
experiment group consisted of a varying number of experiments, 
with ten trials each. Each trial had a ground truth value where 
there was either a difference in conditions or there was not a 
difference in conditions. The ground truth value on whether or not 
an experiment had differences in conditions is represented in the 
“experiments with differences” variable described in section 3.1. 
If a model correctly detected significant differences (p<0.05) 
between conditions it was counted as a true positive. Similarly, if a 
model incorrectly detected significant differences it was counted 
as a false positive. An average of the true positive counts and false 
positive counts for all experiments and trials was used to equally 
weight each experiment group. Some random data samples 
generated errors in analysis. If an error occurred for any trial the 
entire experiment group was removed from analysis to ensure the 
analysis would be as unbiased as possible. There were 79,200 
simulated experiment groups, of which 58 were removed, 
resulting in 77,842 experiment groups analyzed. The data from the 
results of the simulation experiment and the code used can be 
found here. https://sites.google.com/site/tamemethod/ 


Since there was little change in the false positive rate (Type I 
error) regardless of method or factors, we exclude it from further 
analysis. All sets of parameters had a Type I error of roughly 5%, 
which is the threshold we used to determine if a model detected 
significant differences. Our analysis focuses on the true positive 
rates (statistical power) of each method. We ran a repeated 
measure ANOVA to compare the main effects of the parameters 
(see data section) on the statistical power of our method to the 
statistical power of an ANOVA. Out of 70,742 simulated 
experiments, TAME has an average power of 0.376 (SD = 0.357), 
which is slightly better than the ANOVA which had an average 
power of 0.366 (SD = 0.353). This power may seem low, however 
many experiments in the learning science community do in fact 
have low power due to the combination of low sample sizes and 
low effect sizes. 


Table 4 shows the results of a repeated measures ANOVA, which 
determined that the average power of TAME was significantly 
better than the ANOVA (F(1, 70,713) = 804.144, p < 0.001). We 
discuss the effect of each parameter in the following sections. We 
discuss the overall effect each parameter has on both methods and 
compare the effects between each method. 


4.1. Experiments in a Group 

There is no general effect of the number of experiments in a 
group. This is because this variable will only matter for our 
method which takes advantage of a larger number of experiments 
in a group when training a model. An ANOVA trains and tests on 
experiments individually; therefore the number of experiments in 
group has no effect on the power of the ANOVA. Since the 
number of experiments has no effect on the power of the ANOVA, 
it is less likely to see an overall effect considering both TAME and 
the ANOVA. 
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Table 3. Tests of Between-Subject Effects 


Sum of Squares 


Partial Eta 


Source Squared 


37.834 9.459 333.678 < 0.001 


1964.783 69313.168 < 0.001 


| experiments | 0.63 | 0018s | os | 
[percentofexp.withdifs | 0 | tot | tg 
po Error] 204.463] o73 | 0.08 


Table 4. Test of Within-Subjects Effects 


Type Ill Mean — Partial Eta 


0.576 0.576 804.144 |< 0.001 0.011 


method * effect size 2.948 
2.205 


method * samples 1.173 
0.050 


0.421 587.633 < 0.001 0.055 
0.551 769.046 < 0.001 0.042 


0.235 327.340 < 0.001 0.023 
0.006 7.709 < 0.001 0.001 


method * values per factor 0.671 0.335 467.906 < 0.001 0.013 
| 


method * percent of exp. with diff. 0.002 


0.002 2.468 0.116 


error(method) 50.683 70713 a a ee 


There is also no overall noticeable difference between TAME and 
an ANOVA for different number of experiments in a group. Table 
3 shows that the number of experiments in a group has a 
significant effect on power (F(9,70713) = 7.71, p<0.001) with a 
partial eta squared = 0.001. Although the difference between the 
two methods is statistically significant, the effect size is 
insignificant. 


Although there is no overall difference in method type for varying 
the number of experiments in a group, the number of experiments 
has a major impact in the case where there are a large number of 
factors and a small number of samples with a high effect size. 
Figure 1 shows that for a subset of experiments, as the number of 
experiments in a group increases, the difference in power between 
the two methods increases. TAME has a power of 0.27 compared 
to a power of 0.22 for the ANOVA with two experiments in a 
group and TAME has a power of 0.35 compared to a power of 
0.25 for the ANOVA with ten experiments in a group. 


4.2. Number of Factors 

More factors introduce more noise in the data, making it harder to 
detect treatment effects. Table 3 shows that the number of factors 
has a significant effect on power (F(4,70713) = 333.67, p<0.001) 


=P TAME 


=f ANOVA 


Number of Experiments ina Group 


Figure 1. The power as the number of experiments in a group 
increases for experiment groups with 20 samples, four factors, 
and a treatment effect size of 0.8 and 1.0. 


with a partial eta squared = .019. Figure 2 shows that as the 
number of factors increases, the power of TAME decreases less 
than the power of ANOVA. This decrease leads to a difference in 
power between the two methods based on the number of factors. 
The number of factors is statistically significant (F(4,70713) = 
769.046, p<0.001) with a partial eta squared of 0.042. We believe 
this is because TAME accounts for noises better than ANOVA by 
using more data that is available to TAME. 


4.3. Number of Samples 

In general, more samples lead to a better estimate of the true 
means and more power. Table 3 shows that the number of samples 
has a significant effect on power (F(5,70713) = 14204.334, 
p<0.001) with a partial eta squared = 0.5. As the number of 
samples increases, both methods perform equally well. This result 
is expected. 


Table 4 shows that TAME performs better slightly than the 
ANOVA when there are a fewer number of samples, since the 
ANOVA is not an optimal method in this situation. The number of 
samples is a Statistically significant factor when comparing the 
power differences between the two methods (F(5,70713) = 
327.340, p<0.001) with a partial eta squared of 0.023. 


Number of Factors 


Figure 2. The statistical power of TAME and ANOVA by the 
number of factors used to train the models 
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4.4. Effect Size 


A larger treatment effect is easier to detect and therefore has a 
positive impact on power. Table 3 shows that the size of the 
treatment effect has a significant effect on power (F(7,70713) = 
69313.168, p<0.001) with a partial eta squared = 0.873. As the 
size of the effect increases so does the power. 


Table 4 shows TAME performs slightly better than the regular 
ANOVA as the treatment effect increases. The effect size is a 
statistically significant factor when comparing power differences 
between TAME and ANOVA, (F(7,70713) = 587.633, p<0.001) 
with a partial eta squared of 0.055. 


5. REAL DATA RESULT 

We applied both TAME and the ANOVA method on a data set 
composed of twenty-two randomized controlled experiments run 
inside the ASSISTments online learning platform to compare the 
two method on real data [6]. Every experiment in the group is a 
Skill Builder consisting of one control group and one treatment 
group. A Skill Builder is “an assignment type that consists of a 
large number of similar problems, where students must answer a 
specified number of problems (usually three) correctly in a row on 
the same day in order to finish the assignment.” [6]. We applied 
both TAME and an ANOVA on students in the studies, with the 
following factors as training factors: Prior Percent Correct, 
Guessed Gender, Prior Percent Completion, Z Scored Mastery 
Speed, Prior Homework Percent Completion, Z Scored HW 
Mastery Speed. For dependent measure, we use logarithm with 
base ten of the Mastery Speed, which is the number of problems a 
student took to answer three problems correctly in a row [9]. We 
use the logarithm of Mastery Speed to reduce the effect of outliers. 


Table 6 shows that our method can be applied to detect significant 
different between conditions of a real data set. Since the size of 
each experiment in the data set is greater than 100, the result of 
simulation study suggests that TAME is as good at detecting 
significant differences as ANOVA. Both TAME and ANOVA 
detected significant differences between conditions of the same 


experiments (2, 3, 4, 10, and 22). This result further supports our 
claim that TAME is a good alternative to ANOVA, if not better. 


We further investigated the reliability of TAME and ANOVA. For 
each experiment, we trained a model using all of the data from the 
other twenty-one experiments. We then used this model to predict 
the performance on the data in the test experiment. We 
experimented with a different sample size of (10, 20, 30, 40, 50, 
60, 70, 80, 90, and 100) to predict in the test experiment. The 
evaluation of each model was an average of running the model 
1,000 times, with a different random set of data points in the test 
experiment each time. This methodology does not invalidate our 
analysis since TAME was designed to utilize all data from outside 
of the target experiment, such as data from experiments in the 
past, and such data are not affected by the sample size of the target 
experiment. We chose to report on the results of two of the 
experiments in Table 5 and Table 7.; experiment 3, which was the 
experiment that we found the strongest treatment effect for, and 
experiment 6, which was one of the experiments that we did not 
find a significant treatment. 


Table 5. The probability and the confident interval of 
detecting the treatment effect on the resampled data set 
(p < 0.05) on experiment 3 


Experiment | Probability of Detecting Size of Adjusted Wald 
[60 (| 0.8580 | 0.8530 | 0.0217 | 0.0220 _ 
| 80 | 0.9420 | 0.9580 | 0.0147 | 0.0127 
[90 | 0.9660 | 09610 | 0.0115 | 0.0122 _ 


Table 6. Summary statistics and significance for the real dataset 


Mastery Speed Control and Mastery Speed Mastery Speed TAME | ANOVA | ANOVA Partial 
| | Misrenment Grup | contol Group__| _sxgenment Grup | Sig | Sig | Ei Squared 
6] w= 0.63, n=337,6=0.17 | p=0.62,n=165,5=0.18 | w=0.63,n=172,0=0.16 | 0.634 | 0.737 | 0.000 
| 8 | w=0.59,n=455,5=017 | w= 0.59,n=223,o=0.18 | w= 0.59,n=232,6=0.16 | 0.542 | 0.571 | 0.001 | 
9 p=091.n=119,6=0.16 | w=093,n=52,6=0.18 | w=0.90.n=67,0=0.14 | 0.460 | 0.478 | 0.005 
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Table 7. The probability and the confident interval of 
detecting the treatment effect on the resampled data set 
(p < 0.05) on experiment 6 


experiment | probability of detecting Size of Adjusted Wald 


[60 0.0720 | 0.0780 [ 0.0162 _|_0.0167 _ 
[80 [0.0490 | 0.0530 0.0136 | 0.0141 _ 
[99 [0.0620 | 0.0670_[ 0.0151 | 0.0156 


Table 5 shows that for the experiment with the strongest treatment 
effect (experiment 3), TAME is able to detect the treatment effect 
better than ANOVA, especially when the sample size <= 40. This 
result agrees with the result of our simulation study. When the 
treatment effect is not present (experiment 6), the false positive 
rate of both TAME and ANOVA are around 5% as shown in 
Table 7. This result is to be expected from using a p-value 
threshold of 0.05. 


6. CONTRIBUTIONS 


This paper makes three contributions. The first contribution of this 
paper is TAME, a more robust and more effective method of 
detecting treatment effects that can analyze several experiments 
simultaneously. Since the TAME model is not built specifically 
for any particular experiment, it allows the same model to 
generalize to experiments unseen by the model, including future 
experiments. To our knowledge, this is the first method that 
detects treatment effects on multiple experiments individually and 
simultaneously. 


The second contribution this paper makes is that the results from a 
large-scale simulation experiment showed that TAME is better at 
detecting treatment effects compared to an ANOVA by more than 
ten percent in the case where there is a large effect, fewer samples, 
more factors, and with more experiments. This simulation 
experiment validated our proposed method and also showed that 
TAME has slightly better statistical power than an ANOVA and 
never performs worse. TAME can quickly detect large differences, 
such as when the treatment is harmful. It is important to detect 
harmful interventions as soon as possible to ensure that students 
are exposed to the least amount of negative effects. 


The third contribution this paper makes is taking our validated 
method and applying it to real data collected from twenty-two 
randomized controlled experiments run in the ASSISTments 
online learning platform. On this data set, TAME and ANOVA are 
in agreement on significant differences between conditions. This 
result allows the associated researchers to further investigate the 
interventions and their effects, allowing them to better understand 
how students learn and, eventually, develop better tools and 
interventions for students. 


6.1. Future Work and Conclusions 

This work is a first step in building a model that can be used 
across interventions to estimate effect sizes. As such, there are 
many future directions to explore. A possible future work involves 
equally weighting the experiments our model uses. It is rare for all 
experiments to all have the same number of samples. Currently 


our model gives more weight experiments with more samples. 
This may lead to a small number of experiments accounting for a 
large amount of the weight when training a model. For future 
work the weighting of experiments and the effect can be 
investigated. 


In conclusion, we have created a single model that generalizes 
across experiments. We have shown how it can be applied to 
multiple, unseen, experiments in order to evaluate their efficacy. 
This approach is in contrast to creating separate models for each 
intervention we are evaluating. This model is able to detect the 
effect of each intervention relative to other interventions and 
provide a set of features that might affect and interact with 
interventions. In addition, the same trained model can be applied 
to investigate future interventions. We evaluated the effectiveness 
of our model in a simulation study, which shows that our model 
can detect significant differences 10% more than an ANOVA in 
certain cases. We then applied our model to real data and found 
that three out of twenty-two interventions are significantly 
different from the control conditions. 
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ABSTRACT 


An important, yet largely unstudied problem in student data 
analysis is to detect misconceptions from students’ responses 
to open-response questions. Misconception detection enables 
instructors to deliver more targeted feedback on the mis- 
conceptions exhibited by many students in their class, thus 
improving the quality of instruction. In this paper, we pro- 
pose a new natural language processing-based framework 
to detect the common misconceptions among students’ tex- 
tual responses to short-answer questions. We propose a 
probabilistic model for students’ textual responses involving 
misconceptions and experimentally validate it on a real-world 
student-response dataset. Experimental results show that our 
proposed framework excels at classifying whether a response 
exhibits one or more misconceptions. More importantly, it 
can also automatically detect the common misconceptions 
exhibited across responses from multiple students to multiple 
questions; this property is especially important at large scale, 
since instructors will no longer need to manually specify all 
possible misconceptions that students might exhibit. 


Keywords 
Learning analytics, Markov chain Monte Carlo, misconcep- 
tion detection, natural language processing 


1, INTRODUCTION 


The rapid developments of large-scale learning platforms 
(e.g., MOOCs (edx.org, coursera.org) and OpenStax Tutor 
(openstaxtutor.org)) have enabled not only access to high- 
quality learning resources to a large number of students, but 
also the collection of student data at very large scale. The 
scale of this data presents a great opportunity to revolu- 
tionize education by using machine learning algorithms to 
automatically deliver personalized analytics and feedback to 
students and instructors in order to improve the quality of 
teaching and learning. 


1.1 Detecting misconceptions from data 

The predominant form of student data, their responses to as- 
sessment questions, contains rich information on their knowl- 
edge. Analyzing why a student answers a question incorrectly 
is of crucial importance to deliver timely and effective feed- 
back. Among the possible causes for a student to answer a 
question incorrectly, exhibiting one or more misconceptions 
is critical, since upon detection of a misconception, it is 
very important to provide targeted feedback to a student 
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to correct their misconception in a timely manner. Exam- 
ples of using misconceptions to improve teaching include 
incorporating misconceptions to design better distractors for 
multiple-choice questions [10], implementing a dialogue-based 
tutor to detect misconceptions and provide corresponding 
feedback to help students self-practice, preparing prospective 
instructors by examining the causes of common misconcep- 
tions among students [19], and incorporating misconceptions 
into item response theory (IRT) for learning analytics [18]. 


The conventional way of leveraging misconceptions is to rely 
on a set of pre-defined misconceptions provided by domain 
experts [10, 19]. However, this approach is not scalable, since 
it requires a large amount of human effort and is domain- 
specific. With the large scale of student data at our disposal, 
a more scalable approach is to automatically detect miscon- 
ceptions from data. 


Recently, researchers have developed approaches for data- 
driven misconception detection; most of these approaches 
analyze students’ response to multiple-choice questions. Ex- 
amples of these approaches include detecting misconceptions 
in multiple-choice mathematics questions and modeling stu- 
dents’ progress in correcting them [9] via the additive fac- 
tor model [3], and clustering students’ responses across a 
number of multiple-choice physics questions [20]. However, 
multiple-choice questions have been shown to be inferior to 
open-response questions in terms of pedagogical value [8]. 
Indeed, students’ responses to open-response questions can 
offer deeper insights into their knowledge state. 


To date, detecting misconceptions from students’ responses to 
open-response questions has largely remained an unexplored 
problem. A few recent developments work exclusively with 
structured responses, e.g., sketches [17], short mathematical 
expressions [11], group discussions in a chemistry class [16], 
and algebra with simple syntax [4]. 


1.2 Contributions 


In this paper, we propose a natural language processing 
framework that detects students’ common misconceptions 
from their textual responses to open-response, short-answer 
questions. This problem is very difficult, since the responses 
are, in general, unstructured. 


Our proposed framework consists of the following steps. First, 
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we transform students’ textual responses to a number of 
short-answer questions into low-dimensional textual feature 
vectors using several well-known word-vector embeddings. 
These tools include the popular Word2Vec embedding [12], 
the GLOVE embedding [15], and an embedding based on 
the long-short term memory (LSTM) neural network [6]. We 
then propose a new statistical model that jointly models 
both the transformed response textual feature vectors and 
expert labels on whether a response exhibits one or more 
misconceptions; these labels identify only whether or not a 
response exhibits one or more misconceptions but not which 
misconception it exhibits. 


Our model uses a series of latent variables: the feature vectors 
corresponding to the correct response to each question, the 
feature vectors corresponding to each misconception, the 
tendency of each student to exhibit each misconception, and 
the confusion level of each question on each misconception. 
We develop a Markov chain Monte Carlo (MCMC) algorithm 
for parameter inference under the proposed statistical model. 
We experimentally validate the proposed framework on a 
real-world educational dataset collected from high school 
classes on AP biology. 


Our experimental results show that the proposed frame- 
work excels at classifying whether a response exhibits one 
or more misconceptions compared to standard classification 
algorithms and significantly outperforms a baseline random 
forest classifier. We also compare the prediction performance 
across all three embeddings. More importantly, we show ex- 
amples of common misconceptions detected from our dataset 
and discuss how this information can be used to deliver tar- 
geted feedback to help students correct their misconceptions. 


2. DATASET AND PRE-PROCESSING 


In this section, we first detail our short-answer response 
dataset, and then detail our pre-processing approach to con- 
vert responses into vectors using word-to-vector embeddings. 


2.1 Dataset 


Our dataset consists of students’ textual responses to short- 
answer questions in high school classes on AP Biology admin- 
istered on OpenStax Tutor [14]. Every response was labeled 
by an expert grader as to whether it exhibited one or more 
misconceptions. A total of N = 386 students each responded 
to a subset of a total of Q = 1668 questions; each response 
was manually labeled by one or multiple expert graders, re- 
sulting in a total of ~ 60,000 labeled responses. Since there 
is no clear rubric defining what is a misconception, graders 
might not necessarily agree on what label to assign to each 
response. ‘Therefore, we trim the dataset to only keep re- 
sponses that are labeled by multiple graders and they also 
assigned the same label, resulting in 13,099 responses. We 
also further trim the dataset by filtering out students who 
respond to less than 5 questions and questions with less than 
5 responses in every dataset. ‘This subset contains 6, 152 
responses. 


The questions in our dataset are drawn from the OpenStax 
AP biology textbook; we divide the full dataset into smaller 
subsets corresponding to each of the first four units [13], 
since different units correspond to entirely different sub-areas 
in biology. ‘These units cover the following topics: Unit 


N QQ Sparsity (%) 
Unit 1 47 77 0.280 
Unit 2. 101 104 0.243 
Unit 3 73 91 0.236 
Unit 4 43 75 0.315 


Table 1: Dataset statistics. 


1—The Chemistry of Life, Chapters 1-3, Unit 2—The Cell, 
Chapters 4-10, Unit 3—Genetics, Chapters 11-17, and Unit 
4—Evolutionary Processes, Chapters 18-20. To summarize, 
we show the dimensions of the subsets of the data correspond- 
ing to each unit in Table 1. Since not every student was 
assigned to every question, the dataset is sparsely populated; 
Table 1 also shows the portion of responses that are observed 
in the trimmed data subsets, denoted as “sparsity”. 


2.2 Response embeddings 

We first perform a pre-processing step by transforming each 
textual student response into a corresponding real-valued 
vector via three different word-vector embeddings. Our first 
embedding uses the Word2Vec embedding [12] trained on the 
OpenStax Biology textbook (an approach also mentioned 
in [2]), to learn embeddings that put more emphasis on the 
technical vocabulary specific to each subject. We create 
the feature vector for each response by mapping each in- 
dividual word in the response to its corresponding feature 
vector, and then adding them together. Concretely, denote 
Xij = {W1,W2,..-,w7r;;} as the collection of words in the 
textual response of student 7 to question 2, where 7;,; de- 
notes the total number of words in this response (excluding 
common stopwords). We then map each word wz; to its corre- 
sponding D-dimensional feature vector r(w;) € R” using the 
trained Word2Vec model. We use D = 10 for the Word2Vec 
embedding. We then compute the student response feature 


vector as f;,; = Spar r(we). 


Our second word-vector embedding is a pre-trained GLOVE 
embedding with D = 25 [15]. The GLOVE embedding is 
very similar to the Word2Vec embedding, with the main 
difference being that it takes corpus-level word co-occurrence 
statistics into account. Moreover, the quality of the GLOVE 
embedding for common words is likely higher since it is pre- 
trained on a huge corpus (comparing to only the OpenStax 
Biology textbook for Word2Vec). 


Both the Word2Vec embedding and the GLOVE embedding 
do not take word ordering into account, and for misconcep- 
tion classification, this drawback can lead to problems. For 
example, responses “If X then Y” and “If Y then X” may 
have completely different meanings depending on the context, 
where it’s possible for one to exhibit a common misconcep- 
tion while the other one does not. Using the Word2Vec and 
GLOVE embeddings, these responses will be embedded to 
the same feature vector f;,;, making them indistinguishable 
from each other. Therefore, our third word-vector embed- 
ding is based on the long short-term memory (LSTM) neural 
network, which is a recurrent neural network that excels at 
capturing long-term dependencies in sequential data. There- 
fore, it can take word ordering into account, a feature that 
we believe is critical for misconception detection. We im- 
plement a 2-layer LSTM network with 10 hidden units and 
train it on the OpenStax Biology textbook. For each student 
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Visualization of the statistical model. 


Figure 1: 
Black nodes denote observed data; white nodes de- 
note latent variables to be inferred. 


response, we use the text as character-by-character inputs 
to the LSTM network and use the last layer’s hidden unit 
activation values (stacked in a D = 10 dimensional vector) 
as its textual feature fj, ;. 


3. STATISTICAL MODEL 


We now detail our statistical model; its graphical model 
is visualized in Figure 1. Concretely, let there be a total 
of N students, @ questions, and K misconceptions. Let 
Mi; € {0,1} denote the binary-valued misconception label 
on the response of student 7 to question 7 provided by an 
expert grader, with j € {1,...,N}andie {1,...,Q}, where 
1 represents the presence of (one or more) misconceptions, 
and 0 represents no misconceptions. 


We transform the raw text of student 7’s response to ques- 
tion 7 into a D-dimensional real-valued feature vector, de- 
noted by fi,; € R”, via a pre-processing step (detailed in the 
previous section). Let Q C {1,...,Q} x {1,...,N} denote 
the subset of student responses that are labeled, since every 
student only responds to a subset of the questions. 


We denote the tendency of student 7 to exhibit misconcep- 
tion k, with k € {1,...,K} as cx; € R, and the confusion 
level of question 7 on misconception k, as d;,, € R. Then, 
let Pi5,n € {0,1} denote the binary-valued latent variable 
that represents whether student 7 exhibits misconception k 
in their response to question 2, with 1 denoting that the 
misconception is present and 0 otherwise. We model P,,; x 
as a Bernoulli random variable 


DUPE gs = 1) = ® (Ck, as dik), 9) S 22, (1) 


where ®(x) = Ven N(t;0,1)dt denotes the inverse probit 
link function (the cumulative distribution function of the 
standard normal random variable). Given P;,;,, Vk, we model 
the observed misconception label M;,; as 
Ve ‘ OEE age =O Vk 


1 otherwise, 


(i,j) €Q. (2) 


In words, a response is labeled as having a misconception if 
one or more misconceptions is present (given by the latent 
misconception exhibition variables P;,;,,). Given P;,;,. Vk, 
the textual response feature vector that corresponds to stu- 
dent 7’s response to question 2, f;,;, is modeled as 


fig ~N(i+ > PigeOr, Br), VEI)EQ, (3) 
k 


where +y; denotes the feature vector that corresponds to the 
correct response to question 7, 8, denotes the feature vector 
that corresponds to misconception k, and ir denotes the 


covariance matrix of the multivariate normal distribution 
characterizing the feature vectors. In other words, the feature 
vector of each response is a mixture of the feature vectors 
corresponding to the correct response to the question and 
each misconception the student exhibits. In the next section, 
we develop an MCMC inference algorithm to infer the values 
of the latent variables 7;, 0., Ur, Pi,j,n, Ck,j, and di,z, given 
observed data f;,; and M;,;. 


4. PARAMETER INFERENCE 


We use a Gibbs sampling algorithm [5] for parameter in- 
ference under the proposed statistical model. The prior 
distributions of the latent variables are listed as follows: 


vi ~ N (py, Uy), On ~ N(po, Do), Ur ~ IW(he, Ve), 
Ck,j ™ N (te, 02), di,k ee N (ta, 02); 


where IW (-) denotes the inverse- Wishart distribution and py, 
iy, Me, Ue, hr, VF, Me, o2, a, and o4 are hyperparameters. 


We start by randomly initializing the values of the latent 
variables 7;, 9x, Ur, Pij.n, Ck,j, dik, aj, and py; by sampling 
from their prior distributions. Then, in each iteration of our 
Gibbs sampling algorithm, we iteratively sample the value 
of each random variable from its full conditional posterior 
distribution. Specifically, in each iteration, we perform the 
following steps: 


a) Sample P;,;,~: We first sample the latent misconception 
indicator variable P;,;,, from its posterior distribution 


as 
0 ifM;; =0, 
a= leit M, = land Pye =OV k SF &; 
= iM; -=andak ksi: Pi; g@=l, 
where 
i, On, Vk, Ur, Pig eck, Fij,k = Be 


_ Plfi5 


108 epee ee Or 
Cr er) | Chas \s 


Terms in these expressions are given by (1) and (38). 


b) Sample -+y;: We then sample the feature vector that corre- 
sponds to the correct response to each question, 7¥;, 
from its posterior distribution as y; ~ N(py,, 44, ) 
where 


My, = Uy; Sy iy le De) (6,3 — Ss” Pi,5,n9%) | ; 
9:(4,j7 )EQ k 
_ yl ait 
dy; =, ay +1; ) ’ 
where nj = DJ ((t,9) € Q). 
c) Sample 0;: We then sample the feature vector that cor- 


responds to each misconception, 0;,, from its posterior 
distribution as 0, ~ N(wo,, Xo, ) where 


Lo, = Dig. ye Ss Gi —¥-) Pain 0;,7) ; 


i,9:Pi,j.p=1 k! £k 
24 ved 
where ng = D0, J (Pig,n = 1). 
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Figure 2: Comparison of the prediction performance of the proposed model against RF on our AP Biology 
dataset using the ACC metric as the number of latent misconceptions K varies, with the LSTM embedding. 


d) Sample Sir: We then sample the covariance matrix ip 
from its posterior distribution as 


Ur nailing 


where N=) 045 ((2 ne = on and M = da iJ: ace 


= @5J; Pe = Leds rOn)” 


e) Sample cz,; and d;,,: In order to sample cz,; and d;,~, we 
first sample the value of the auxiliary variable 2;,;,x 
(following the standard approach proposed in [1]) as 


Zigk ~ N*(cr.j + diz, 1), V(i, 7) € O, 


(fig 


where N*(-) denotes the truncated normal random 
distribution truncated to the positive side when P;,;.~ = 
1 and negative side when P;,;,, = 0. We then sample 
cr,; from its posterior distribution as 


Ck,j = N (Mex. j ’ Oe, ) 


where nj = 5°, I ((4, J EQ), a Ga : = 1/(1/o2 + n;), 
then sample d;,, from its posterior distribution as 


dik oe N (Mad; ’ C4, 1.) 


where 4. 1 C/o; +n,;), and la; 
yo ia,jeo Zi,5, _ Chg) 


= 04, ,(Ma/og + 


We run the iterations detailed above for a number of T’ 
total iterations with a certain burn-in period, and use the 
samples of each latent variable to approximate their posterior 
distributions. 


Parameter inference under our model suffers from the label- 
switching issue that is common in mixture models [5], mean- 
ing that the mixture components might be permuted between 
iterations. We employ a post-processing step to resolve this 
issue. We first calculate the augmented data likelihood at 
each iteration, (indexed by @) we then identify the iteration 
fmax With the largest augmented data likelihood, and per- 


mute the variables 0%, ci, > and d; , that best match the 


variables Oo, Ca, and tags After this post-processing 


step, we can nly calculate ie posterior means of each one 
of these sets of variables by taking averages of their values 
across non burn-in iterations. 


5. EXPERIMENTS 


We experimentally validate the efficacy of the proposed frame- 
work using our AP Biology class dataset. We first compare 
the proposed framework against a baseline random forest 


(RF) classifier that classifies whether a student response ex- 
hibits one or more misconceptions. We then show common 
misconceptions detected in our datasets and discuss how 
the proposed framework can use this information to deliver 
meaningful targeted feedback to students that helps them 
correct their misconceptions. 


5.1 Experimental setup 

We run our experiments with K € {2,4,6,8,10} latent mis- 
conceptions with hyperparameters wy = fe = Op, YY = 
Ye = Vp Hip, ie =10, 2 =— pe = 0, ond oc; =o, — 1, 
for a total of JT’ = 500 iterations with the first 250 iterations 
as burn-in. We compare the proposed framework against 
a baseline random forest (RF) classifier’ using the textual 
response feature vectors f;,; to classify the binary-valued 
misconception label M;,;, with 200 decision trees. 


We randomly partition each dataset into 5 folds and use 4 
folds as the training set and the other fold as the test set. We 
then train the proposed framework and RF on the training 
set and evaluate their performance on the test set, using 
two metrics: i) prediction accuracy (ACC), i.e., the portion 
of correct predictions, and ii) area under curve (AUC), i.e 
the area under the receiver operating characteristic (ROC) 
curve of the resulting binary classifier [7]. Both metrics take 
values in [0,1], with larger values corresponding to better 
prediction performance. We repeat our experiments for 20 
random partitions of the folds. 


For the proposed framework, the predictive probability that 
a response with its feature vector f;,; exhibits a misconcep- 
tion, i.e., the probability that at least one of the K latent 
misconception exhibition state variables take the value of 1, 
is given by 1 — pi,;, where 
Pi,j = P(Mi, i 0| £35, Yi, Br, Ox, Vk, Ck, 51 Ui, i) 

P(Fi,5|0%, Pi,j,6 = 0, Vk) | [, PCP. 


Pi j,k V 


J; k — Olek, 5, di, k) 


where in the last expression we omitted the conditional de- 
pendency of f;,; on +; and “ir due to spatial constraints. 
For RF, the predictive probability is given by the fraction of 
decision trees that classifies M;,; = 1 given fj,;. 


5.2. Results and discussions 
The number of latent misconceptions K is an important pa- 
rameter controlling the granularity of the misconceptions that 


'The RF classifier achieves the best performance among 
a number of off-the-shelf baseline classifiers, e.g., logistic 
regression, support vector machines, etc. Therefore, we do 
not compare it against other baseline classifiers. 
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Unit 1 Unit 2 


Unit 3 Unit 4 


ACC AUC ACC 


AUC ACC AUC ACC AUC 


Proposed framework 0.789+0.014 0.762+0.027 0.774+0.015 0.758+0.023 0.779+0.019 0.752+0.020 0.887+0.011 0.774+0.029 


RF 0.762+0.019 0.645+0.025 0.735+0.011 


0.676+0.014 0.758+0.017 0.630+0.024 0.8730.009 0.604+0.034 


Proposed framework 0.867+0.014 0.762+0.048 0.870+0.010 0.821+0.024 0.893+0.017 0.794+0.039 0.953+0.015 0.892+0.047 


RF 0.876+0.014 0.697+0.022 0.859+0.013 


0.7710.040 0.883+0.008 0.616+0.043 0.948+0.019 0.731+0.006 


Proposed framework 0.873+0.042 0.772+0.093 0.865+0.025 0.829+0.044 0.873+0.027 0.792+0.061 0.936+0.032 0.832+0.094 


RF 0.8650.035 0.711+0.086 0.838+0.028 


0.722+0.043 


0.854+0.028 0.6970.057 0.9310.025 0.709+0.105 


Table 2: Performance comparison on misconception label classification of a textual response in terms of the 
prediction accuracy (ACC) and area under the receiver operating characteristic curve (AUC) of the proposed 
framework against a random forest (RF) classifier, using the AP Biology dataset and the Word2Vec (top), 


GLOVE (middle), and LSTM (bottom) embeddings. 


we aim to detect. Figure 2 shows the comparison between 
the proposed framework using different values of K and RF 
using the ACC metric with the LSTM embedding. We see 
an obvious trend that, as K increases, the prediction perfor- 
mance decreases. The likely cause of this trend is that the 
proposed framework tends to overfit as the number of latent 
misconceptions grows very large since some of our datasets 
do not contain very rich misconception types. Moreover, the 
number of common misconceptions varies across different 
units, with Unit 2 likely containing more misconception types 
than Units 1 and 4. 


We then compare the performance of the proposed framework 
against RF on misconception label classification in Table 2 
using K = 2 and all three embeddings. The proposed frame- 
work significantly outperforms RF (1-4% using the ACC 
metric and 4-18% using the AUC metric) on almost all 4 
data subsets using every embedding. The only case where 
the proposed framework does not outperform RF is on Unit 1 
using the GLOVE embedding. We postulate that the reason 
for this result is that this unit is about chemistry and has a 
lot of responses with more chemical molecular expressions 
than words; therefore, the proposed framework does not 
have enough textual information to exhibit its advantages 
(grouping responses that share the same misconceptions into 
clusters) over the RF classifier. 


Both the proposed framework and RF perform much better 
using the GLOVE and LSTM embeddings than the Word2Vec 
embedding. This result is likely due to the fact that these 
embeddings are more advanced than the Word2Vec embed- 
ding: the GLOVE embedding considers additional word 
co-occurrence statistics than the Word2Vec embedding, is 
trained on a much larger corpus, and has a higher dimension 
D = 25, while the LSTM embedding is the only embed- 
ding that takes word ordering into account. Moreover, both 
algorithms perform best on Unit 4, which is likely due to 
two reasons: i) the Unit 4 subset has a larger portion of its 
responses labeled, and ii) Unit 4 is about evolution, which 
results in responses that are much longer and thus contains 
richer textual information. 


5.3. Uncovering common misconceptions 

We emphasize that, in addition to the proposed framework’s 
significant improvement over RF in terms of misconception 
label classification, it features great interpretability since 
it identifies common misconceptions from data. As an il- 
lustrative example, the following responses from multiple 
students across two questions are identified to exhibit the 
same misconception in the Unit 4 subset using the Word2Vec 


embedding: 


Question 1: People who breed domesticated animals 
try to avoid inbreeding even though most domesticated 
animals are indiscriminate. Evaluate why this is a good 
practice. 

Correct Response: A breeder would not allow close rel- 
atives to mate, because inbreeding can bring together 
deleterious recessive mutations that can cause abnor- 
malities and susceptibility to disease. 

Student Response 1: Inbreeding can cause a rise in 
unfavorable or detrimental traits such as genes that 
cause individuals to be prone to disease or have unfa- 
vorable mutations. 

Student Response 2: Interbreeding can lead to harm- 
ful mutations. 


Question 2: When closely related individuals mate with 
each other, or inbreed, the offspring are often not as fit 
as the offspring of two unrelated individuals. Why? 
Correct Response: Inbreeding can bring together rare, 
deleterious mutations that lead to harmful phenotypes. 
Student Response 3: Leads to more homozygous 
recessive genes thus leading to mutation or disease. 
Student Response 4: When related individuals mate 
it can lead to harmful mutations. 


Although these responses are from different students to dif- 
ferent questions, they exhibit one common misconception, 
that inbreeding leads to harmful mutations. Once this mis- 
conception is identified, course instructors can deliver the 
targeted feedback that inbreeding only brings together harm- 
ful mutations, leading to issues like abnormalities, rather 
than directly leading to harmful mutations. 


Moreover, the proposed framework can automatically dis- 
cover common misconceptions that students exhibit without 
input from domain experts, especially when the number of 
students and questions are very large. Specifically, in the 
example above, we are able to detect such a common mis- 
conception that 4 responses exhibit by analyzing the 1016 
responses in the AP Biology Unit 4 dataset; however, it 
would not likely be detected if the number of responses was 
smaller and fewer students exhibited the misconception. ‘This 
feature makes it an attractive data-driven aid to domain ex- 
perts in designing educational content to address student 
misconceptions. 


We show another example that the proposed framework can 
automatically group student responses to the same group 
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according to the misconceptions they exhibit. ‘The example 
shows two detected common misconceptions among students’ 
responses to a single question in the Unit 2 subset using the 
LSTM embedding: 


Question: What is the primary energy source for cells? 
Correct response: Glucose. 

Student responses with misconception 1: 

a) sunlight b) sum c) The sun d) he sun? 


Student responses with misconception 2: 
a) ATP b) adenosine triphosphate 
c) ATPPPPPPPPPPPPP d) atp mitochondria 


We see that the proposed framework has successfully iden- 
tified two common misconception groups, with incorrect 
responses that list “sun” and “ATP” as the primary energy 
source for cells. Note that the LSTM embedding enables 
the framework to assign the full and abbreviated form of 
the same entity (“adenosine triphosphate” and “ATP”) into 
the same misconception cluster, without employing any pre- 
processing on the raw textual response data. ‘The likely 
reason for this result is that our LSTM embedding is trained 
on a character-by-character level on the OpenStax Biology 
textbook, where these terms appear together frequently, thus 
enabling the LSTM to transform them into similar vectors. 
This observation highlights the importance of using good, 
information-preserving word-vector embeddings for the pro- 
posed framework to maximize its capability of detecting 
common misconceptions. 


6. CONCLUSIONS AND FUTURE WORK 


In this paper, we have proposed a natural language processing- 
based framework for detecting and classifying common mis- 
conceptions in students’ textual responses. Our proposed 
framework first transforms their textual responses into low- 
dimensional feature vectors using three existing word-vector 
embedding techniques, and then estimates the feature vec- 
tors characterizing each misconception, among other latent 
variables, using a proposed mixture model that leverages 
information provided by expert human graders. Our ex- 
periments on a real-world educational dataset consisting of 
students’ textual responses to short-answer questions showed 
that the proposed framework excels at classifying whether 
a response exhibits one or more misconceptions. Our pro- 
posed framework is also able to group responses with the 
same misconceptions into clusters, enabling the data-driven 
discovery of common misconceptions without input from 
domain experts. Possible avenues of future work include i) 
automatically generate the appropriate feedback to correct 
each misconception, ii) leverage additional information, such 
as the text of the correct response to each question, to further 
improve the performance on predicting misconception labels, 
iii) explore the relationship between the dimension of the 
word-vector embeddings and prediction performance, and 
iv) develop embeddings for other types of responses, e.g., 
mathematical expressions and chemical equations. 
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ABSTRACT 


Scientific explanations, which include a claim, evidence, and 
reasoning (CER), are frequently used to measure students’ deep 
conceptual understandings of science. In this study, we developed 
an automated scoring approach for the CER that students 
constructed as a part of virtual inquiry (e.g., formulating questions, 
analyzing data, and warranting claims) in an intelligent tutoring 
system (ITS), called Ing-ITS. Results showed that the automated 
scoring of CER was strongly correlated with human scores when 
validated using independent sets of data from both the same inquiry 
task/question, as well as when using data from a different inquiry 
task/question. These findings imply that automated CER is a very 
promising approach to reliably and efficiently score scientific 
explanations in open response format for both small- and large- 
scale assessments. It also provides Ing-ITS with the capability to 
assess the full complement of inquiry practices described by NGSS. 


Keywords 


automated assessment, scientific explanation, claim, evidence, 
reasoning 


1. INTRODUCTION 


The implementation of the Next Generation Science Standards 
(NGSS) has led to a need for assessments that are able to capture 
students’ competencies at science inquiry practices [21]. Open- 
response tasks have been used in assessments for science inquiry 
because they can elicit students’ communication skills, conceptual 
understandings, and ability to reason from evidence due to the 
measurement constraints of traditional multiple-choice items [10]. 
Rubrics for scoring students’ explanations have been developed 
according to frameworks, such as Toulmin’s [27] model of 
argumentation [8,16]. A modified version of Toulmin’s model 
consists of three components: claim (an assertion about an 
investigated question), evidence (data or observations that support 
the assertion, i.e., the claim), and reasoning (articulating how the 
evidence supports the claim and how scientific principles explain 
the relationship between the data and claim). 


Previous studies have developed rubrics to assess the accuracy of 
claim, evidence, and reasoning (CER) in students’ scientific 
explanations. Gotwals and Songer [8] applied a rubric following 
the CER framework in order to score middle school students’ 
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explanations in an ecological science assessment. The rubric 
scoring for each component of CER was on a scale from 0 to 2 
according to the accuracy and depth students’ responses. McNeill 
et al. [16] scored students’ responses to explanation prompts for 
middle school chemistry with a rubric that also followed the CER 
format using a 0 to 2 scale. These general rubrics for open response 
items provide some insight regarding the argumentation skill level 
of students, which can be valuable for guiding teachers’ instruction 
and feedback. Open response items, however, can be time 
consuming and costly to score [28]; they can be inaccurately scored 
due to human factors such as rater fatigue [19], and rubrics can be 
interpreted and used differently by different raters [1]. One way to 
resolve these issues is through the use of automated scoring 
techniques [30]. 


Automated scoring techniques also permit automated feedback to 
students as they write scientific explanations or immediately 
following their writing tasks, when students have the opportunity 
to revise their writing. Automated, real-time feedback has been 
found to: significantly reduce the time between response 
submission and feedback relative to human scorers [15] and be as, 
if not more, effective than feedback presented by teachers [3]. 
While automated scoring presents an efficient and accurate means 
for promoting student learning gains, no studies, to date, have 
developed techniques for online, automated scoring of scientific 
explanations according to CER. 


The current paper presents a new automated scoring approach to 
CER using the techniques of both natural language processing and 
machine learning. The approach addresses accuracy as well as 
important structural components of explanations as identified in the 
CER framework. The approach was validated using correlations 
between human scores and automated scores for scientific 
explanations produced in the Inq-ITS learning environment. 
Automated scoring of CER will: dramatically reduce time and 
expense, improve the efficiency and accuracy of CER scores, allow 
for instantaneous feedback, and make individualized instruction 
from teachers and/or automated scaffolding possible. Furthermore, 
scoring these data is critical because our data show that many 
students who have acquired a deep understanding of science 
content and inquiry practices, cannot articulate in words what they 
have learned. Conversely, some students are able to simply parrot 
what they have heard/read when doing written CER tasks, but do 
not actually understand the science content or practices [4, 11]. 


1.1 Automated Open Response 

Automated scoring techniques have been developed to assess 
students’ open responses in computer-assisted assessments and 
learning environments for science. Techniques include natural 
language-processing (NLP), such as regular expressions [12], to 
determine whether students’ scientific explanations include key 
conceptual phrases [3, 13, 14]. The specific techniques and rubrics 
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used for automated scoring of science open response items vary 
across programs as described below. 


The Summarization Integrated Development Environment (SIDE) 
uses a combination of NLP techniques and machine learning 
algorithms to score scientific explanations for the inclusion of 
biology concept knowledge [9, 20]. This system yielded 
correlations between human-scored and  computer-scored 
responses ranging from 0.79 to 0.87 depending on the sample of 
participants. Disagreement was attributed to differences in 
linguistic tendencies across samples [9]. A later study by Nehm et 
al. [20] on the same system found that agreement between human 
and computer scoring was strong (i.e. k > .81). The SIDE program 
may be a valuable tool in scoring student scientific explanations 
[20], but is limited to identifying the presence of concepts within 
responses, and as such is not useful at scoring students’ 
competencies at generating claims, evidence, and reasoning, which 
are critical to NGSS inquiry practices. 


Another program that has been used to autoscore scientific 
explanations is the SPSS Text Analysis (SPSSTA) program [29], 
which uses language-processing procedures to identify terms and 
note patterns within texts [25]. A study by Weston et al. [29] 
applied SPSSTA to score undergraduate responses to biology 
explanation prompts. The agreement between human-coded 
responses and the SPSSTA for different levels of an analytic rubric 
ranged from a kappa of 0.67 to 0.88. The SPSSTA program relates 
to SIDE in terms of its potential to identify important concepts, but 
is unable to automatically produce machine learning algorithms 
from a trained data set [9], and this is limited in utility. 


EvoGrader automatically scores constructed explanations using 
machine-learning algorithms [17]. A study compared EvoGrader 
scores to human scores based on the identification of nine key 
evolution concepts and strong agreement was found, as indicated 
by kappas above 0.85 for all concepts except one (k = 0.71) [17]. 
The EvoGrader automated assessment system was able to produce 
human-like scoring of key evolutionary concepts, but would need 
retraining in order to be generalized to other domains. 


The c-Rater program scores scientific explanations based on the 
presence of central concepts using natural language processing 
[13]. A study by Liu et al. [13] compared human and c-Rater scores 
for four energy open response questions and found moderate 
agreement with Pearson correlations ranging from 0.67 to 0.72. 
While c-Rater was able to capture the presence of concepts, the 
program did not perform highly enough to be recommended for use 
as a sole scorer. Liu et al. [14] examined the agreement between 
human scorers and c-rater-ML, which is an autoscoring program 
that uses support vector regressions, a machine learning technique. 
Kappas across eight science explanation items ranged from 0.62 to 
0.90, indicating good to very good agreement between human 
raters and c-rater-ML on a 5-point rubric for connecting key ideas 
[14]. The high agreement on certain explanation items 
demonstrated the potential for c-rater-ML to be used as a sole 
scorer, but, as noted by the authors, sensitivity to variations in 
phrasing of central concepts needed to be improved. 


Automated scoring programs for scientific explanations exemplify 
the potential for accurate and efficient scoring of open responses in 
terms of the presence of scientific concepts, but do not provide 
Opportunity for scoring more fine-grained components of 
explanations. That is, auto-scoring techniques have yet to address 
argumentative components of explanations that are central to 
science inquiry, namely students’ competencies at generating 
claims, evidence for claims, and articulating the link between the 
two using reasoning, which are required by NGSS. Auto-scoring 


specific sub-components of responses, as we have done in our 
work, enables automated scaffolds that can, in turn, target specific 
areas of student difficulty. The rubrics for CER in previous studies 
broadly categorized responses into incorrect, partially correct, or 
fully correct, but failed to break down CER into finer-grained sub- 
skills or sub-components. As a result, previous rubrics have been 
unable to pinpoint exactly why students are having difficulties 
constructing explanations. In the present study, we developed a 
fine-grained rubric modified from McNeill et al. [16]. 


1.2 Description of Ing-ITS 

Ing-ITS is a web-based intelligent tutoring system for Physical, 
Life, and Earth science that automatically assesses scientific 
inquiry practices at the middle school level in real time within 
interactive microworld simulations [5, 24]. Within each 
microworld, inquiry practices proposed in the NGSS for middle 
school are assessed including: question asking/hypothesizing, 
collecting data, analyzing data, warranting claims, and 
communicating findings using a CER framework. 


Automated scoring has been implemented within Ing-ITS with 
patented algorithms [5] to measure sub-skills of each inquiry 
practice based on actions recorded in log files [7, 23]. Automated 
scoring of sub-skills in Ing-ITS required building detectors based 
on data-mined algorithms that captured variations of complex 
behaviors, such as designing controlled experiments [7, 24]. In 
order to build detectors, human raters used text-replay tagging to 
identify key behavioral features and train models that determined 
the presence of particular sub-skills [23]. The additional 
implementation of Bayesian Knowledge Tracing and Knowledge 
Engineering has enabled real-time, automated feedback that 
scaffolds students as they engage in inquiry practices in Ing-ITS [7] 
and has been found to result in significant inquiry learning gains 
for students [18, 22]. Sao Pedro and his colleagues [22, 24] found 
that students who had no experience with designing controlled 
experiments and testing stated hypotheses were able to acquire 
these skills after receiving scaffolded feedback from Inq-ITS’s 
pedagogical agent, Rex. Moussavi, Gobert, and Sao Pedro [18] 
found that students who received scaffolds on data interpretation 
skills in one science topic of Ing-ITS were better able to apply those 
skills in a new science topic. 


While automated scoring and feedback has been successfully 
applied to student actions in Ing-ITS, automated scoring has yet to 
be developed for written explanations. The automated scoring 
approach presented in this paper allows for automatic scoring of 
students’ written scientific explanations in Inq-ITS, as well as lays 
the groundwork for the development of specific, automated 
feedback for open response items. 


2. METHOD 


2.1 Participants and Materials 

Participants were 293 middle school students from 18 classes in six 
public middle schools who completed the Ing-ITS density virtual 
lab. The Density Virtual Lab contained three activities aimed to 
foster understanding about density of a liquid when using: different 
shapes of a container (narrow, square, and wide), different types of 
liquid (water, oil, and alcohol), and different amounts of liquid 
(quarter, half, and full). This study validated the automated scoring 
for the scientific explanations that students constructed in the first 
two activities: shape-density (N = 293) and type-density (N = 268) 
after a series of scientific investigations. The type-density data set 
was used to train and test the model with the method of 10-fold 
cross-validation. The shape-density data set was used to further test 
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the model to examine how well the model performed when it was 
generalized to an independent data set. 


2.2 Rubrics and Inter-Rater Reliabilities 
Scientific explanations in Ing-ITS consisted of three components: 
claim, evidence, and reasoning (CER). As previously stated, other 
rubrics have been unable to pinpoint exactly why students are 
having difficulties when constructing explanations. In the present 
study, we developed a fine-grained rubric modified from McNeill 
et al. [16], described as follows. 


Claim was graded by four sub-skills: independent variable (IV), IV 
relationship (IVR; the conditions that students changed in the 
controlled target IV), dependent variable (DV), and DV 
relationship (DVR; the effect of IV on DV). For example, a good 
claim that a student wrote in the type-density activity was: I found 
out when you change the type (IV) of the liquid from water to oil 
(IVR), the density (DV) will decrease (DVR). IV and DV were 
graded with binary scores: 1 for presence of the sub-skill and 0 for 
the absence. IVR was classified into four levels: (1) correct answers 
in which students reported two controlled conditions of the target 
IV, (2) general answers in which students stated IVR using general 
expressions rather than specifically stating the conditions of change 
(e.g., I found that the change (IVR) of type of liquid (IV) changes 
(DVR) the density (DV), (3) partial answers in which students only 
reported one controlled target condition (e.g., The density of water 
is the largest), and (4) incorrect answers. Therefore, correct [VR 
was given 1 point; general IVR, 0.8 points; partially correct IVR, 
0.5 points; and incorrect IVR, 0 points. DVR in the type-density 
activity was scored according to three levels: correct (1 point), 
general (0.8 as shown in IVR example), and incorrect. DVR in the 
shape-density activity was scored dichotomously, correct (1 point) 
versus incorrect (0 points). The DVR (shape of the container) did 
not affect the DV (density), so responses were either correct or 
incorrect and no general expressions were involved. 


Evidence was scored by two. sub-skills: sufficiency and 
appropriateness [6]. Sufficiency was a measure of whether students 
provided sufficient evidence. If two controlled target conditions 
were stated, then 2 points were given. Mentioning only one 
controlled target condition was insufficient and was given 1 point. 
Using general expressions was given 0.5. Not mentioning any 
controlled target condition was incorrect and was given 0 points. 
Appropriateness was a measure of whether students provided 
appropriate data, such as the data of mass, volume, and density, as 
displayed in students’ data tables in Inqg-ITS. This sub-skill was 
consistent with the sufficiency of evidence, but focused on the data. 
Here is an example of a good answer in the shape-density activity: 
No matter what the container shapes are, narrow or wide, and the 
mass of oil was 212.5 (data of mass) while the volume was 250 (data 
of volume). The density resulted in 0.85 g/ml (data of density). If 
students specified the data of density, they were given 1 point for 
DVR in appropriate evidence; otherwise, 0 points. If they reported 
both the data of mass and volume, they were given 1 point. If they 
only reported the data of either mass or volume, they were given 
0.5 points. If they did not report any data of mass or volume, they 
were given 0 points. 


Reasoning was measured by three sub-skills: theory, connection 
between data and the claim, and data that supports or refutes the 
claim. Theory referred to whether students stated a scientific 
principle related to density, here being: the properties of a substance 
(based on the type of liquid) affect the density, not the shape of the 
container. Four categories were classified: (1) complete theory for 
2 points (e.g., When looking at the data chart, it is noticeable that 
the mass and volume don't change so the density doesn't change.), 


(2) partial but closer to complete for 1 point (e.g., only mentioning 
two of three properties), (3) partial but closer to none for 0.5 points 
(e.g., only mentioning one property), and (4) incorrect or no 
theories for 0 points (e.g., no property was mentioned). Connection 
between data and claim referred to whether students specified that 
their data supports or refutes their claim. If they did, 1 point was 
given (e.g., My evidence supports my claim...). If they only partially 
stated the connection, 0.5 points were given (e.g., It will support my 
claim...) because the student did not specify whether the data or 
evidence supported the claim. If there were no expressions 
specified, 0 points were given. Data in the reasoning task were 
similar to the claim task with one main difference. In scoring 
reasoning data, mentioning either IV or IVR was accepted as 
correct (1 points) and mentioning only one condition of change was 
considered partially correct (0.5 points). 


Two expert raters scored students’ CER according to the fine- 
grained rubric. The interrater-reliabilities by Cronbach’s «© were 
993, .994, .938 and the intraclass correlations were .986, .988, .882 
for claim, evidence, and reasoning, respectively, higher than human 
agreement in prior studies (e.g., [14]). Disagreements were 
discussed until agreement was reached and agreed upon scores 
were used for analyses. 


2.3 Automated Scoring 

The target sub-skills were extracted using regular expressions 
(RegEX) based on the rubrics used by human raters in section 2.2. 
RegEX is a natural language processing technique that often 
applies algorithms to search for specific phrases or phrases that are 
semantically equivalent to a target concept [26]. In ITSs, RegEX 
has been used to accurately identify the presence of target concepts 
in students’ responses [12]. Table 1 displays some examples of the 
RegEX that we used to extract features. RegEXs were generated 
based on semantically similar phrases that corresponded to a 
particular concept noted in the rubric. 


Table 1. es of RegEX in the Shape-Density Activity. 


wv | shape 
[vk | (rarrow-Fequare| (@quare wide) 
pv density 
pvr 
. 


Claim (0~4) 


(A((?!n[o']t|doesn(‘)?t.)* (same|constant)) 
Sufficient Same as IVR 


((mass.*volume).*250)| 
((volume*mass).*250) 


Evidence 
(0~4) 


=a 185178 


Theory ((mass.*volume).*density) 
Connection | (datajevidence).*(support|provelindicate| 
hnaccaens! |show|refute). Agi nypeuiests| 1601) 
Data[IVAVR| __shapel(narrow."(widelsquare)) 
| DV | Same as DV claim 


If the sub-skill was binary, RegE.X was used to detect the presence 
or absence of the content with Python programming language. If 
the sub-skill contained more than two levels, RegEx was used to 
detect the presence or absence of the sub-skill with a higher score 
first, and then with a lower score. Each sub-skill at each level was 
assigned to a binary score, 1 for the presence and 0 for absence of 
the sub-skill. If the sub-skill had more than two scales, each scale 
was assigned to a binary score first and then transformed into the 
true scores. Take IVR in claim as an example (e.g., I found that the 
change of the container shape does not change the density). RegEX 


Reasoning (0~6) 


Same as DVR claim 
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matched two conditions first and assigned this category a score of 
0 because the two specific shapes were not mentioned. Then 
RegEX matched general expressions and found the target 
expression, change of the container shape, so 1 was assigned to the 
general expression category and matching stopped when the target 
content was found. In the analysis, this claim IVR was given a score 
of 0.8 points. 


In this study, we used an if-then algorithm to search for a particular 
word or phrase, as is done in AutoTutor [12]. Take the IV (e.g., 
shape of the container) in the claim as an example. First, RegEX 
“shap” was generated to match the word “shape.” Second, this 
RegEX was used to search a written claim. Third, if there was the 
word “shape” in the claim, then IV was present and scored as “1”. 
If no word “shape” existed in the claim, then IV was considered 
absent and scored as “0”. Moreover, before searching the target 
work, the misspelt target words were corrected to avoid a decrease 
in agreement [14]. If-then algorithms enhanced the performance 
especially for the more complex sub-skills, such as IVR, by 
matching the higher-level features first and then filtering down to 
the lower-level features. The modification of RegEx and algorithms 
typically took about 10 iterations for complicated sub-skills, such 
as theory, IVR, but fewer iterations for simple sub-skills, such as 
IV and DV. Each iteration took about 1-30 minutes, depending on 
the complexity of the sub-skills. 


2.4 Statistical Analyses 

Linear regression analyses were conducted using M5-prime 
method to assess whether sub-skills were predictive of human 
scores of CER. We used two methods to validate the model. The 
first method was 10-fold cross-validation. The second method was 
to further validate the model with an independent data set in a 
different inquiry, shape-density activity. If the model yields good 
performance with similar statistics as the cross-validation analyses, 
our confidence in model stability is increased and the model could 
be generalized to different Inq-ITS activities. We used the Pearson 
correlations as previous studies [14] did to evaluate automated 
scores and followed the same rules for describing their magnitude 
[2]: none (0.00—0.09), small (0.10—0.30), moderate (0.31—0.50), 
and large (0.51—1.00). 


3. RESULTS 


3.1 Performance of Automated Scores 

A linear regression analysis for automated claim scoring with 10- 
fold cross-validation yielded a significant model in the type-density 
activity, r = .97, p < .001. The four sub-skills of claims were 
combined to account for 94% of the variance in the human claim 
scores, with correlation coefficients (6) of 1.02, 1.04, 1.07, 0.86 (p 
< .001) for IV, IVR, DV, and DVR, respectively. When this model 
was Validated in the shape-density data set, it was also significantly 
correlated with human scores, r = .94, p < .001, which explained 
88% of the variance in the human scores. 


The same procedures were applied to the automated evidence 
scores. The cross-validation analysis showed a significant model, r 
= .97, p < .001, with three sub-skills accounting for 94% of the 
variance in the human evidence scores, with fs of 0.99, 0.87, and 
0.90 (p < .001) for sufficiency, appropriateness IVR, and DVR, 
respectively. When this model was validated in the shape-density 
data set, the automated scores were also almost perfectly correlated 
with human scores, r = .97, p < .001, which explained 94% of the 
variance in the human evidence scores. 


Finally, the same analysis was conducted for automated reasoning 
scores. The cross-validation analysis indicated a significant model, 
r = .84, p < .001, with five sub-skills accounting for 71% of the 


variance in the human reasoning scores, with #s of 0.21, 0.94, 0.85, 
1.09, and 0.96 (p < .001) for theory, connection between data and 
claim, data of IV/IVR, DV, and DVR, respectively. When this 
model was validated in the shape-density data set, the automated 
scores were highly correlated with human reasoning scores, r = .85, 
p < .001, which explained 72% of the variance in the human 
reasoning scores. 


These findings imply that the automated CER scores could best 
capture human CER scores in the independent sets of data from 
both the same inquiry task/question and data from a different 
inquiry task/question (r = .84~.97, larger than threshold of .50) [2]. 
These findings imply that the automated methods with the sub- 
skills of CER are a promising approach to automatically score 
scientific explanations respective of CER in science inquiry. This 
automated method with regular expressions and if-then algorithms 
enables automated scoring to be generalized to different inquiry 
activities without additional training and testing of the model, and 
yields satisfactory performance. 


3.2 Analyses of Errors 

Across three components of scientific explanations, automated 
claim and evidence scores almost perfectly predicted human claim 
and evidence scores when validated using independent sets of data 
from both the same inquiry task/question, as well as when using 
data from a different inquiry task/question. Reasoning showed a 
very good correlation between automated scores and human scores 
in both data sets, but this correlation was relatively low as 
compared to claim and evidence. This section, therefore, analyzes 
the errors of reasoning in the type-density data set. Table 2 displays 
the confusion matrix of automated rating and human rating for 
reasoning, which explicitly demonstrated a discrepancy for 
disagreement in scores between humans and automated scores. 
Results showed a high discrepancy for scores 2 — 4. Specifically, 
when the human score was 2, only 40% were given a score of 2 by 
automated methods. Almost half of the remaining responses were 
given 1 and the other half were given 3 points or more. Similarly, 
when the human score was 3, only 44% was scored 3 by automated 
methods. More than 30% was scored 2 and about another 30% was 
scored 4 — 5. It is the same for the human score of 4. Less than 40% 
of responses were scored 4 by automated methods, while more than 
half was scored 3 by automated methods. 


Table 2. Confusion Matrix for Reasoning. 


| Human (Row) | 0 | 1 [| 2 | 3 | 4 [5 [6] N 
aa | a Ge ee ee ee ee 
a es a ee 
a a ee eee eee 
po 8 6 84 | 47 | 6 [1 | 2 | 106 | 
4 ff fs a7 fof fay 


[wf 20 [a7 [oe [aa [2 [6 [7 [268 
Note. 0-6 are the total reasoning scores rated by humans and 
automated methods based on the analytic rubrics. 


This relatively lower agreement may have been largely due to 
inaccuracy that was caused by simple regular expressions. As 
constructed reasoning responses involve more complex causal 
relationships and different levels of sub-skills, the simple regular 
expressions may not completely cover all alternative expressions in 
students’ responses. To examine which sub-skill showed high 
discrepancy between human rating and automated rating, we 
compared the agreement for the five sub-skills of reasoning 
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between automated scores and human scores. Results showed very 
high agreement for the first four sub-skills: 85% for theory, 85% 
for connection, 92% for data IV/IVR, and 95% for data DV, 
whereas the agreement for data DVR was only 46%. The 
confusion-matrix analyses for data DVR _ revealed that the 
automated scores used the binary score for this sub-skill (i.e., 
incorrect versus correct), whereas the humans rated DVR on the 
four levels mentioned in section 2.3. Binary scoring for DVR in the 
reasoning of the type-density activity was used to remain consistent 
with the scoring used in the shape-density activity. In the shape- 
density activity, there were no partially correct or general answers. 
Only correct answers (i.e. “density of liquid is the same” or 
“density of liquid doesn’t change”) or incorrect answers were 
considered. In the type-density activity, responses for DVR 
included correct answers (i.e. “density of the liquid decreases from 
water to oil”), general answers (i.e. “density of liquid changes due 
to the change of liquid”), partial answers (i.e. “density of water is 
largest”), and incorrect answers. With the rule of least effort, we 
did not change the algorithms from one activity to another to satisfy 
the multiple categories of students’ responses accounted for by 
humans. Thus, a large disagreement arose due to the binary scoring 
used by the automated method versus the four level scoring used 
by humans. 


Even though the criteria that humans and automated methods used 
to score DVR in reasoning were different, automated scores still 
yielded pretty good performance. The performance can be 
improved if the automated method scores reasoning using the same 
criteria as humans. A future study may explore whether the 
consistency in DVR between automated and human rating would 
improve the performance of reasoning scores overall. 


4. DISCUSSION 


These findings demonstrate that using regular expressions to match 
key sub-skills of CER with if-then algorithms is a very promising 
approach to effective and efficient automated scoring of open 
response scientific explanations. This assertion can be confirmed 
based on two key factors. First, the automated methods showed 
very good correlations with human scores for CER in the 
independent sets of data with the 10-fold cross-validation analyses 
in the same inquiry task/question as well as in a different inquiry 
task/question. Previous studies on automated scoring of constructed 
response items showed that good correlations between automated 
scores and human scores ranged from .60 to .91 (e.g., [14]). In our 
study, automated scores for claim and evidence reached .97 in the 
cross-validation analyses in the same inquiry task/question. When 
transferred to a different inquiry task/question, results remain .97 
for evidence and .94 for claim. These results greatly exceed the 
current state of research on automated scoring of scientific 
explanations, as they are almost perfectly correlated with human 
scoring of claim and evidence scores. Even for reasoning using 
evidence, a more complex task, results were good as well, ranging 
from .84 to .85. One explanation for the slightly lower performance 
of automated reasoning scores is that the agreement between 
humans was lower relative to agreement for claim and evidence 
(.88 versus .99) due to the complexity of the reasoning task. 
Another explanation is that the regular expressions and algorithms 
applied across different tasks were the same. If we modify regular 
expressions to satisfy each activity, the performance of automated 
scoring for reasoning will likely increase. 


Second, the sub-skill features that were extracted by regular 
expressions along with if-then algorithms not only consistently 
predicted human scores, but also were simple to implement. A 
central factor to the success of this method was that experts were 


able to generate accurate regular expressions to identify sub-skills 
of explanations in science inquiry. More specifically, experts knew 
how to identify the sub-skills of CER, how to develop a fine- 
grained rubric to guide human and machine scoring, and how to 
generate nearly-complete regular expressions to capture aS many 
alternative expressions as possible in students’ responses. The use 
of appropriate regular expressions was key to the success of our 
automated scores. Regular expressions were easier and quicker to 
generate for simple sub-skills such as IV, IVR, and DV for claim 
and data in evidence. For more complex sub-skills, such as DVR 
and theory, more time was needed to develop sets of alternative 
expressions. However, once the algorithms yield good 
performance, only a slight modification is needed for different 
activities. Compared to manual scoring, the time and effort that was 
spent on the development of automated scoring was worthwhile. 
Another key to the success of our automated scoring method was 
the development of the fine-grained rubric. Our rubric was finalized 
Over many iterations. When we used more general rubrics, the inter- 
rater reliabilities for reasoning were very low (r = .50). With the 
fine-grained rubric, the reliabilities increased to .88. The high 
agreement between human coders guaranteed the possibility of 
high agreement between human scores and machine scores. 


The success of automated scoring for open responses in science 
inquiry will greatly contribute to science education by making 
possible immediate individualized feedback on_ students’ 
explanations, as well as adaptive instruction and scaffolding. The 
implementation of automated scoring in computer-assisted learning 
and assessment systems will provide students with instant feedback 
on their constructed CER, which will allow students to immediately 
know their strengths and weaknesses with regard to scientific 
explanations. Teachers could then use the explicit feedback from 
automated scoring to adapt instruction based on what students need. 
In addition, the automated scoring of CER in science inquiry will 
advance the development of computer-assisted systems for inquiry, 
such as Inq-ITS. Inq-ITS has used automated scoring to implement 
immediate feedback and scaffolding for inquiry skills involved in 
“doing” science, such as formulating a question/hypothesis, 
collecting data, analyzing data, and warranting claims. Automated 
scoring could also be used to align students’ “doing” science skills 
with their science “writing” skills. The alignment of sub-skills 
involved in “doing” with “writing” during inquiry will allow for 
comparison of students’ conceptual knowledge with their ability to 
communicate such knowledge. Thus, this automated scoring 
approach truly advances science education by meeting the 
comprehensive assessment criteria that NGSS [21] demands: 
science assessments that include both students’ understandings of 
core ideas, their skills at conducting inquiry, as well as their skills 
at effectively articulating what they know by generating a claim and 
evidence for that claim, and articulating their reasoning linking 
their claim to their evidence. 


Even though the automated methods for scientific CER 
demonstrated good performance, there is one limitation that needs 
to be addressed in future studies. Namely, regular expressions for 
reasoning may be modified to adapt to each task/question to align 
with criteria used by humans. In doing so, the accuracy may be 
improved. 
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ABSTRACT 


The emergence of Massive Open Online Courses (MOOCs) has 
enabled new research to analyze typical behaviors of learners. In 
this paper, we investigate whether this research is generalizable to 
other courses that are backed by a learning management system 
(LMS) as MOOCs are. Building on methods developed by others, 
we characterize individual learning behaviors in different ways 
taking into account specificities of the LMS we use. We then 
apply clustering techniques to uncover typical behaviors in two 
university courses. One course, JavaFX, teaching about the 
software programming framework, has been offered as a 
supplementary online course to students enrolled in an online 
degree. Enrolling in this course was voluntary and students did 
not earn any credit towards their degree; in this sense, the JavaFX 
course bears similarities to a MOOC though it is neither massive 
nor open to everybody. The other course 1s a classical face-to-face 
course on Advanced Web Technologies (AWT) backed by our 
LMS; students earn a degree when they pass the final exam. It 
turns out that the different characterizations of individual learning 
behaviors are consistent for the JavaFX course and uncover 
typical behaviors reminiscent of those found by others in 
MOOCs, while they aren’t as applicable to the AWT course. 
However, typical behaviors found in the AWT course give 
insights on styles that lead to better marks. 


Keywords 
MOOC, Typical behaviors, X-means clustering 


1. INTRODUCTION 


The emergence of MOOCs with the general observation of their 
low completion rates has triggered new research to analyze 
typical behaviors of learners in MOOCs and brought forth 
evidence for various engagement/disengagement patterns such as 
completing, auditing, disengaging and sampling, as proposed by 
Kizilcec et al. [1]. In their paper, Kidzinski et al. [2] write that 
categorization schemes as found in [1] and others “remain robust 


Christopher Krauss 
Fraunhofer FOKUS 
Kaiserin-Augusta-Allee 31 
10589 Berlin, Germany 


christopher.krauss 
@fokus.fraunhofer.de 


Agathe Merceron 
Beuth Hochschule fur Technik Berlin 
Luxemburger Str. 10 
13353 Berlin, Germany 


agathe.merceron 
@beuth-hochschule.de 


in terms of generalizability within the MOOC’s context, but they 
are hard to generalize outside of it’. In this paper, we tackle that 
claim. We investigate whether this research can offer interesting 
insights to other courses that are backed by a learning 
management system (LMS), even though analyzed courses are not 
necessarily massive nor open, and even not completely online. 


We investigate two courses presented with the Learning 
Companion App (LCA) [3]. The LCA is a LMS designed in the 
first place for vocational training. Compared to other LMSs 
common in higher education like Moodle, the Coursera-platform 
or edX, LCA has two salient features to encourage self-reflection 
and support efficient learning. The first feature concerns the 
learning objectives that need to be associated with each learning 
object (LO) in the course. All the learning objectives of one 
chapter are displayed for rating at the beginning and at the end of 
any chapter. A learner can assess how much s/he knows each 
learning objective. These self-assessments encourage learners to 
reflect on their previous knowledge, and on how much they know 
after learning the chapter. The second feature is a 
recommendation engine that suggests learners what to learn next 
[4]. Learners are free to consult these recommendations. 
Comprehensive user interactions are stored as xAPI statements 
[5]. The LCA is independent of any topic and any institution and, 
therefore, can be used in other contexts and for other courses. 


The two courses considered in this study, JavaFX and Advanced 
Web Technologies (AWT) have taken place in the context of 
higher education. The JavaFX course has been offered as an 
optional online course to students enrolled in an online degree in 
computer science. These students learned to program graphical 
interfaces with the older framework Swing instead of the newer 
framework JavaFX. By taking part in this course, students did not 
earn any mark for their studies, they only increase their 
knowledge of the topic. The AWT course targeted master 
computer science students. It was a classical face-to-face course 
taught with the support of the LCA in winter semester 2016/17. 
When enrolled in this course, students usually had the aim of 
passing the final exam and earn the corresponding credits for their 
master degree. 


In this study, we follow and adapt the approach of [1, 6] and 
explore several different ways of qualifying individual learning 
behaviors as similar. It turns out that for the JavaFX course, these 
different ways are consistent and uncover two to three typical 
learning behaviors reminding those exhibited by Kizilcec et al. 
[1]. For the AWT course, only one way of qualifying behaviors 
turns out to be sensible. The uncovered typical learning behaviors 
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from both courses match those exhibited in [1, 6] and give insight 
on styles that lead to better marks. 


This paper is organized as follows. Related works are discussed in 
Section 2. Specificities of courses in our LMS, the Learning 
Companion App, are presented in Section 3. Subsequently, 
different ways of characterizing individual learning behaviors are 
explained and typical learning behaviors found in both courses are 
presented and discussed. Conclusion and future works are given 
in Section 5. 


2. RELATED WORK 


Kizilcec et al. [1] investigated learners’ engagement in courses 
from Coursera which consist of weekly videos and assessments, 
and proposed four typical engagement / disengagement patterns 
that they call 

e Completing: “learners who completed the majority of the 
assessments offered in the class’’, 

e Auditing: “learners who did assessments infrequently if at 
all and engaged instead by watching video lectures’, 

e Disengaging: “learners who did assessments at the 
beginning of the course but then have a marked decrease in 
engagement (their engagement patterns look like 
Completing at the beginning of the course but then the 
student either disappears from the course entirely or 
sparsely watches video lectures)” and 

e Sampling: “learners who watched video lectures for only 
one or two assessment periods”. 

These categories have been identified in three courses; however, 
their proportions differ in each course. To discover these 
categories, they have first characterized a student by a tuple 
giving her/his status each week: “on track [T] (did the assessment 
on time), behind [B] (turned in the assessment late), auditing [A] 
(didn't do the assessment but engaged by watching a video or 
doing a quiz), or out [O] (didn't participate in the course in that 
week)” [1]. 


In an attempt to replicate the work of [1], Ferguson and Clow [6] 
suggest that the methodology used to uncover typical learning 
behaviors in a MOOC’s context does not necessarily generalize to 
another MOOC adopting different elements of pedagogy and 
learning design. Since the courses analyzed in [6] follow a social 
constructivist pedagogy, Ferguson and Clow adapt the 
methodology of [1]. They consider also participation in 
discussions and end up with 10 values to characterize the weekly 
Status of a student, instead of the four values T, B, A and O 
introduced in [1]. They have identified the following typical 
learning behaviors: Samplers (“Learners in this cluster visited, but 
only briefly’, similar to sampling above), Strong Starters (“these 
learners completed the first assessment of the course, but then 
dropped out’), Returners (“these learners completed the 
assessment in the first week, returned to do so again in the second 
week, and then dropped out’), Mid-way Dropouts (“these learners 
completed three or four assessments, but dropped out about half- 
way through the course’), Nearly There (“these learners 
consistently completed assessments, but then dropped out just 
before the end of the course’), Late Completers (“this cluster 
includes learners who completed the final assessment, and 
submitted most of the other, but were either late or omitted 
some”) and Keen Completers (“this cluster consists of learners 
who completed the course diligently, engaging actively 
throughout” similar to completing above). The two approaches in 
[1, 6] share the same principle of selecting a priori features that 


are sensible to describe a student's individual engagement, and 
then use K-means clustering to discover typical learning 
behaviors. 


Gelman et al. [7] adopt a different, more bottom-up approach to 
discover typical behaviors: they use a set of 21 features that they 
can extract week by week from the log data and adapt non- 
negative matrix factorization to obtain weekly behaviors that are 
supported by a combination of those features. This approach is 
attractive because it does not need a careful selection of features 
to characterize the behavior of a student; instead, the algorithm 
selects and combines features from the set it receives as input. A 
difficulty lies in the interpretation and the practical use of the 
discovered behaviors. While an auditing behavior “learners who 
did assessments infrequently if at all and engaged instead by 
watching video lectures” [1] is easy to derive, it is less clear what 
a weekly deep behavior “the associated students must have spent 
a long time on a single resource” [7] means for educators. 


In this paper, we adopt the first approach and adapt it to our 
context, taking inspiration from the work in [6]. 


3. COURSES IN THE LCA 


The Learning Companion App (LCA) is a whole infrastructure 
that can be thought of as LMS equipped with a repository for 
learning objects, a recommendation engine and a _ learning 
analytics module. It is at the same time an App in responsive 
design which is the entry point for students to access courses, 
learning objects (LOs) and lecture schedule as well as to get 
recommendations for the next best contents to be learnt; 
furthermore, it triggers the tracking of all relevant user 
interactions [8]. 


In LCA, each learning object at the lowest level is paired with its 
metadata that includes at least one learning objective, a typical 
learning time and its prerequisites. A learning object can be a 
piece of text, a video, an exercise (similar to an exercise of an 
assessment in a MOOC), an animation, even a downloadable 
document and so on. Learning objects are bundled into learning 
units and a course is essentially a sequence of learning units. The 
learning objectives of a learning unit are the union of the learning 
objectives of its learning objects. A learning unit is rendered in 
the LCA as an “accordion” GUI element with a specific 
sequential structure. The top item of the accordion view that can 
be opened is the list of the learning objectives of that unit. 
Learners can rate each learning objective and so indicate how 
much they know already on that topic, from 1 “know nothing” to 
5 “expert”. We call this list self-assessments. This item is 
followed by the sequence of the LOs of that unit. The user can 
interact with the learning objects by clicking on the title in the 
accordion view whereupon the requested content is presented. 
The user is only shown one learning object at any time so that 
s/he can concentrate fully on this content. Following the sequence 
of LOs, the next item in the accordion view is again the list of 
learning objectives. By rating them, a student can reflect on how 
much s/he knows after learning the unit. The next item in the 
accordion view allows students to provide feedback on the typical 
learning time for that unit (from 1, “way too little time” to 5, 
“way too much’) and give comments. The last item in the 
accordion view opens a discussion thread on that unit. Apart from 
its sequence of learning units, a course contains a schedule which 
specifies dates for the start and end of the course, as well as when 
each learning unit should be learned. 
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All users’ interactions are stored using the xAPI specification [5] 
in the open-source learning record store called Learning Locker!. 
The accordion view allows inferring how long any item of the 
view is opened. Typical mined data include number of clicks on 
all items of the accordion view (self-assessments, LOs, feedback, 
discussion threads), time an item is open, answers and 
performance in exercises, ratings of pre- and post-study self- 
assessments, feedback, messages of discussion threads. Note that 
a student can access any LO directly by clicking on the 
recommendations. For this study, this does not change the kind of 
interactions that are stored. 


The two courses discussed in this paper make all the learning 
material available from the start of the course to encourage self- 
pacing and self-organization of students. Furthermore, the time 
schedule of the courses is indicative only, in the sense that there is 
no penalty if someone does not follow the schedule. Finally, in 
both of them, students did not post in the discussion threads; they 
only wrote (few) comments in the feedback area. However, the 
two courses differ significantly in their didactical organization 
and contents. 


The JavaFX online course, available for a period of 11 weeks, 
offers an introduction into the X-Framework for the 
development of platform independent Java applications and 
targets bachelor computer science students. This course is 
suggested as an optional online course to students enrolled in an 
online computer science bachelor course. By taking part in this 
course, students do not earn any mark for their studies, only 
knowledge for themselves. 


It comprises three learning units. Each learning unit has about five 
learning objectives and contains about fifteen to thirty LOs (units 
are not of equal length). About half of the LOs are texts to explain 
concepts and example programs, and half are exercises 
(single/multiple choice, cloze tests and so on). The last LO before 
the self-assessment of the learning objectives is a comprehensive 
programming task; students can send their program per email and 
obtain a manually commented evaluation. Based on _ the 
educational discussion on MOOCs, Daniel [9] pointed out that 
“students seek not merely access, but access to success”. 
However, success can be different for each student. Driven by this 
consideration, a specific LO has been added to this course 
allowing each student to rate her/his motivation on a scale from 0 
(do nothing) to 100 (engage thoroughly with everything offered). 
51 students enrolled in this course; however, there were 23 no- 
shows (defined in [10] as “people register but never login to the 
course while it is active’). Only the remaining 28 students are 
considered for the analysis in this paper. The 28 users generated 
3624 xAPI statements in total during the course. 


Advanced Web Technologies (AWT) targets master computer 
science students. Technical experts teach in 12 weekly presence 
lectures diverse topics that are of interest for future web 
developers — from web technology basics, such as HTML, over 
media delivery and content protection, to personalization through 
recommender systems, and the Internet of Things. The lectures 
are mostly held with slides created in PowerPoint showing 
definitions, specifications, and source code, animations for 
concepts and videos for practical examples. The about 1000 
presented slides are converted to digital learning objects, one slide 
being a single LO, and grouped into 105 learning units for the 


' Learning Locker. See: https://learninglocker.net/ 


representation in the LCA — with videos, animations and 
additional multiple-choice questions at the end of the learning 
units. Moreover, as some students still want to learn with a 
printed version of the slides, the last LO of the accordion view is 
a downloadable PDF file containing all the slides of the unit. 


142 students enrolled for AWT in winter semester 2016/17; 
however, there were 43 no-shows. Only the remaining 99 students 
are considered for the analysis in this paper. Especially in the first 
weeks before the official registration deadline, students frequently 
change their mind regarding participating in specific courses — 
which might explain the high loss ratio of the participants. At the 
end of the course, students can earn credits by completing an one- 
hour exam consisting of 50 multiple choice questions and 5 bonus 
questions. Exactly 75 students completed the final exam (even 
two who did not used the LCA) and the average mark was 1.90 
(only one student failed the exam; note that the best mark is 1.0 
and the worst possible mark is 5.0). The 99 users generated 92825 
xAPI statements in total during the course. 


In contrast to the courses offered by [1], [6] and the JavaFX 
course, the primary goal for students of AWT is to pass the final 
exam. AWT does not offer any intermediate assessment. Students 
access online material, first and foremost, for the wrap-up of face- 
to-face lectures and for exam preparation. 


4. METHODOLOGY AND RESULTS 


In our context, there are multiple sensible ways to compare 
students in their learning behaviors. Because this time schedule is 
purely indicative for students and all the materials are available 
from the start of the course, we compared students on how they 
have interacted with the course independently of time. In this 
paper, we investigate four such ways. 


Clicks only: In this way, we consider only click counts per 
learning object. A student is represented by a vector that 
represents how many times s/he has clicked each element of the 
whole course. In this way, two students are similar if they access 
almost the same learning objectives, learning objects, feedback, 
and motivation (for JavaFX only as AWT does not have this 
feature) a similar number of times. 


Elapsed time: In this way, elapsed time spend on that learning 
object replaces click count. A student is represented by a vector 
that has the size of all learning objects of the course. The learning 
objectives, feedback, and motivation are not considered because 
the time spend is not tracked individually for these features. Two 
students are similar if they spend a similar overall time on the 
same learning objects (texts, videos, exercises, etc.). The overall 
time is the sum of the elapsed times in each visit. 


Assessment scores: In this way, we consider performance on all 
assessments, including programming tasks of the JavaFX course. 
A student is represented by a vector that has the size of all 
assessments; values are ratings given in all self-assessments, 
marks earned in all exercises, rating given in feedback and 
motivation (AWT does not have the motivation feature). The final 
exam for AWT is not considered. Two students are similar if they 
achieved similar scores on all assessments. 


Elapsed time and assessment scores: In this way, we consider a 
combination of the latter two: elapsed time on what students look 
at (texts, videos and so on) and scores on what students answer 
(self-assessments, exercises and so on). Two students are similar 
if they spend a similar overall time on similar learning objects 
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Figure 1: Plot of the centroids of the 2 clusters returned by X-means in the JavaFX course. The x-axis represents all the elements of the 
course (learning objectives, learning objects etc.); the y-axis gives the average normalized number of clicks per element. 


such as texts, videos, slides and so on (that are not exercises) and 
achieve similar scores on all assessments. 


We used RapidMiner? and applied the X-means clustering 
algorithm with Euclidean distance. X-means finds an optimal 
number of clusters and is known to find fewer clusters than K- 
means [11]. Due to the size of the vector representing each 
student (in the way Clicks only a student is represented by a 
vector with 143 values in the JavaFX course) and the small data 
sets, clustering is challenging. Furthermore, in RapidMiner, X- 
means is implemented in such a way that it will always find a 
minimum of two clusters, even if the data is uniform. To validate 
that the data does cluster naturally, we applied also K-means and 
checked for the drop in the curve plotting K against the sum of 
squared errors (which corresponds to the average within distance 
of RapidMiner). Values of clicked counts and elapsed time have 
been normalized. Assessment values like marks in exercises or 
self-assessments are already stored as scaled values. 


4.1 JavaFX 


X-means returns exactly the same two clusters for three 
approaches: Clicks only, Elapsed time and Elapsed time and 
assessment scores (the results of the fourth approach are 
described later on). Figure | shows a visualization of these two 
clusters for the Clicks only way; it lists all elements of the course 
on the X-axis and shows the corresponding normalized number of 
clicks of the clusters’ centers on each element. The first cluster 
(cluster 1, the blue line in the upper diagram of Figure 1) consists 
of 5 students who engage with many elements such as self- 
assessments, learning objects and also interact with the 
automatically generated features like feedback. When sorting the 
students according to the number of distinct elements they have 
accessed in the course, these 5 students come on top. On average, 
students in this cluster have accessed 72 distinct course elements. 
If the elements are restricted to the exercises only, as they best 
match assessments in MOOCs, these 5 students remain on top: 
except for one, who performed 15 exercises, they have performed 
25 to 30 exercises out of 34. The other 23 students in the course, 
represented in the second cluster (cluster 2, red line and bottom 
diagram of Figure 1) accessed the learning objects less often and 
did very few self-assessments. On average, students in this group 
have accessed 10 distinct elements of the course and solved 
exercises infrequently, if at all four times or less. Transferred to 
the categories in [1], we find that these two patterns of 
engagement are reminiscent of completing and auditing but 
without any reference to time. In [1] it is clear that completing 
students have solved assessments week by week because 
assessments are available in the course week by week only. In our 
course, completing students could have solved exercises regularly, 


? Rapid Miner. See: https://rapidminer.com/ 


or all during a few weeks only, depending on their own time- 
management. 


The K-means algorithm finds an optimal set of 4 clusters; see the 
upper elbow-curve of Figure 2 with the drop when k is 4. One 
cluster matches exactly cluster 2 found with X-means, while the 
cluster with 5 students is split into 3 clusters. This finding shows 
that data naturally clusters; however, the two clusters returned by 
X-means are more interpretable. 


X-means returns three clusters when using Assessment scores. 
Cluster 1 with the pattern completing is also found here. Cluster 2 
above is now split into two clusters: one with 18 students and 
cluster 3 with 5 students. What distinguishes these 5 students 
from the remaining 18 students is that they answered self- 
assessments and engaged with exercises mostly from the first unit 
of the course, hardly from the follower units. They correspond to 
disengaging in [1] although beginning of the course does not refer 
to time but to the sequence of the units that are displayed in the 
LMS. K-means algorithm finds an optimal set of 5 clusters; as 
before, the completing cluster is split into 3 clusters. 


At first, it may be surprising that the three characterizations: 
Clicks only, Elapsed time and Elapsed time and assessment scores 
give exactly the same clusters: completing and auditing. With 
some consideration, this result is understandable: what 
distinguishes the most two learners is when one has accessed an 
element and the other not. A completing student has accessed 


Figure 2: Plot of K against average within distance scenario 
clicks only for JavaFX (above) and AWT (below). 
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Figure 3: Plot of the centroids of the 3 clusters returned by X-means in the AWT course. The x-axis lists all the assessments of the course 
(self-assessments left part, exercises and feedback on time right side); the y-axis gives the scaled score of the center per element. 


much more elements of the course than an auditing student; these 
two behaviors are discovered by X-means. The characterization 
assessment scores reduces the number of features used to perform 
clustering (interactions with LOs such as text or videos are 
omitted) and allows for distinguishing a sub-category in the 
auditing group: disengaging; those learners are completing 
activities primarily in the first unit of the course and then stop. 
They have hardly engaged with the course in the following units, 
what makes them similar to auditing students in the three other 
ways: they have engaged infrequently with exercises and have 
looked at few learning objects. 


4.2 AWT 


The first three approaches (Clicks only, Elapsed time and Elapsed 
time and assessment scores) lead to no meaningful results for the 
AWT course. On the one hand, K-means does not show a natural 
clustering of the data for any of these ways: plotting K against the 
average within distance does not show any drop, as the curve for 
the AWT course in Figure 2 bottom shows. On the other hand, 
these three ways are not really adequate to describe the 
engagement of an individual student due to the digital content of 
this course: at the end of each unit, there is a .pdf file containing 
all the slides of this unit. A student might download only the .pdf 
file of each unit and look at it as much as s/he wants, another 
student might access all the slides online multiple times. From the 
interactions that are stored and evaluated, these two students look 
very different, yet their learning behaviors are similar. At the 
beginning of the course, 66 Students have requested PDFs, and 
this number of students decreased to the end of the course to 19. 
One third of all students have requested all PDFs. 


In contrast, for the Assessment scores approach, X-means 
generated three definite clusters. Figure 3 shows a visualization of 
these three clusters; it lists all assessments of the course on the X- 
axis and shows the corresponding score of the clusters’ centers on 
each element. Two parts are clearly distinguishable: a rather flat 
left part and a right part where the blue (top) and the red (bottom) 
lines show spikes. The rather flat left part corresponds to the self- 
assessments; generally, not many students rated themselves. The 
right part corresponds to the exercises and student feedback 
Cluster 1 contains 9 students inclusive the one who did not pass 
the final exam (the upper diagram with the blue line of Figure 3). 
Students in this cluster provided self-assessments in the first three 
units, and worked out exercises but did not achieve good scores. 
They remind of Strong Starters and Returners proposed in [6] 
when this vocabulary is adapted to the sequential order of the 
units instead of the first weeks of the course. To some extent, they 
exhibit also some kind of completing pattern in terms of exercises, 


because they completed almost half of them: on average 22 from 
a total of 48. Their average mark in the final exam is 2.03 which 
is slightly worse than the general average of 1.90. The biggest 
cluster contains 64 students (cluster 2, the diagram in the middle 
with the green line of Figure 3) and is similar to the pattern 
auditing because they did exercises infrequently if at all: on 
average 1 out of 48. However, they did access .pdf files. All 
learners who did not participate in the final exam fall into this 
cluster. The average mark of the students in this cluster who 
participated in the final exam is 2.23 (no-shows are neglected), 
which is below the general average. The last cluster contains 26 
students and shows a completing pattern (cluster 3, the bottom 
chart with the red line of Figure 3). If one sorts the students 
according to the number of distinct exercises they have solved in 
the course, 25 of these 26 students are the top 25. They have 
worked on nearly all the exercises, on average 42 out of 49, and 
completed almost all of them correctly. The final exam mark in 
this completing cluster reaches 1.50 on average, a better mark 
than the overall average of 1.90. The last two clusters are 
interesting: a completing student does well in the final exam, 
while an auditing student does worse in the final exam or even 
does not attend it. Although, as opposed to [1], these patterns do 
not tell anything on when students accessed the assessments in the 
time schedule. 


K-means algorithm finds an optimal set of 4 clusters. It finds 
exactly the same big cluster of 64 students and finds almost the 
same first cluster as X-means does. However, it splits the last 
cluster to isolate three students. Students in both groups still 
solved in average 42 exercises but they differ in how they 
engaged with self-assessments. The small group of 3 students 
rated 74 self-assessments in average and the other students only 
rated 3 self-assessments in average in the first units of the course. 


5. DISCUSSIONS AND FUTURE WORK 


Considering the particularities of our courses, we have defined 
four meaningful ways of characterizing an individual learning 
behavior. We have used X-means clustering to extract typical 
learning behaviors from two distinct university courses, an 
optional online JavaFX course and a compulsory face-to-face 
course about Advanced Web Technologies. Because of the small 
data sets, particularly for the JavaFX course, clustering is 
challenging. We found that students do not act at random. In the 
JavaFX course, we could derive evidence behaviors that remind 
of patterns found in [1]: completing, auditing, and disengaging. 
Only the Assessment scores way delivers reliable clusters for 
AWT. From the three clusters uncovered by X-means, two are 
particularly interesting. All students that were ultimately not 


Proceedings of the 10th International Conference on Educational Data Mining 224 


participating in the final exam were located in the auditing 
cluster. Other students in that cluster, who participate in the final 
exam, tend to do less well than average. Students of the 
completing cluster tend to pass the exam with very good marks. 
Note that completing, auditing, and disengaging in this paper are 
similar to [1] in terms of which kind of learning material has been 
accessed frequently or not; as opposed to [1], our approach does 
not provide information on when in the time schedule the material 
has been accessed. 


The present results suggest that typical behaviors found in 
MOOCs can be transferred to other courses - with care. This 
situation bears similarities with predicting students at risk of 
deserting a course. Numerous articles show that models with good 
predictive power can be built to predict drop-off and also the 
performance of students in a course. These articles show also that 
there is no set of features and no classifier that works best in all 
contexts: no one-size fits all. On the contrary, the set of features 
and classifiers needs to be adjusted to the data and setting at hand 
to achieve a good predicting power. The work of [2] also supports 
this view for MOOCs. Our results suggest that the situation is the 
same for typical behaviors. We adjusted methods of others to our 
context and were able to extract interesting and interpretable 
typical behaviors from relatively small data sets. This work 
considers rather simple features like clicks and elapsed time. 
Future work should focus on a more sophisticated feature 
extraction. 


In our setting, there is a time schedule, even if it is indicative 
only. It could make sense to devise ways of characterizing an 
individual behavior taking this time schedule into account. The 
method of [1] needs careful adaptation because a learner might be 
on track or behind and might also be early. Works on these lines 
have already begun. Preliminary work shows that four of the five 
students of the completing cluster of the JavaFX course began 
only after three weeks to engage with the course, while the 
majority of the completing cluster of the AWT course engaged 
with the course regularly each week. Another future work is to 
reflect on implications for the recommendation engine and the 
learning analytics module. Should the recommendation engine be 
adjusted to different typical behaviors for example? We plan to 
integrate these findings in the overall behavioral feedback shown 
to students. 
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ABSTRACT 


We investigate the use of consumer-grade eye tracking to 
automatically detect Mind Wandering (MW) during learning from 
a recorded lecture, a key component of many Massive Open 
Online Courses (MOOCs). We considered two feature sets: 
stimulus-independent global gaze features (e.g., number of 
fixations, fixation duration), and _ stimulus-dependent local 
features. We trained Bayesian networks using the aforementioned 
features and students’ self-reports of MW and validated them in a 
manner that generalized to new students. Our results indicated 
that models built with global features (F; MW = (0.47) 
outperformed those using local features (F; MW = 0.34) and a 
chance-level model (F; MW = 0.30). We discuss our results in the 
context of MOOC development as well as integrating MW 
detection into attention-aware MOOCs. 


Keywords 

eye-gaze, Massive Open Online Courses, lecture viewing, 
intelligent tutoring systems, mind wandering, attention-aware 
learning 


1. INTRODUCTION 


Imagine you are giving a lecture on population diversity, most of 
your audience is engaged; however, one or more of your students 
are displaying signs of inattentiveness (e.g., dozing off, staring 
blankly). You may call on such a student in the hope of bringing 
their attention back to the lecture. You may even suggest a short 
break if too many students appear to be inattentive. This 
adaptation to your lecture was only possible because you had the 
ability to continually monitor your students’ levels of attentional 
focus and to alter your instruction in real-time. 


Now imagine you are teaching a Massive Open Online Course 
(MOOC). Your students are no longer in the same room as you 
and in many cases are not viewing the lecture at the same time 
you are delivering it. You no longer have the ability to monitor 
students’ attentional focus and adapt to signs of inattentiveness. 


Despite the challenges for educators, MOOCs are an increasingly 
popular method amongst students for e-learning and distance 
learning [16]. They have also been popular in traditional learning 
environments as alternate ways for delivering material [27]. 
MOOCs are often distributed world-wide to a variety of students 


across platforms with no limitations on individual participation. 
While there are some advantages to MOOCs with respect to 
promoting access, little is known with regard to how they address 
individual learners’ needs. MOOCs have long had issues with 
extremely high dropout rates [1, 37], far greater than those in 
‘traditional’ classroom environments. Though there has been 
work tying students’ experiences with MOOCs to the dropout rate 
[37], there has been little exploration as to individual user 
experiences and trends that lead to retention problems [1, 17]. 


As a step towards better understanding student engagement within 
MOOCS, we focus on one form of disengagement called mind 
wandering (MW). MW is defined as an attentional shift from task- 
related processing towards internal task-unrelated thoughts [31]. 
In the context of learning, both lab and field studies have 
consistently reported MW rates in the 20%-50% range [21, 26, 
34]; work looking at specifically recorded lectures showed the 
MW rates to be 20-45% [26, 34]. Additionally, a recent meta- 
analysis revealed a negative correlation between MW and 
performance across a variety of tasks [23]. MW negatively 
impacts a learner’s ability to attend to external events [30], to 
encode information into memory [29], and to comprehend 
learning materials [28, 30]. As a result, MW is generally found to 
have a negative impact on learning outcomes. 


Attempts to assuage the cost of MW rely on knowing if MW has 
occurred. However, detecting MW is no easy task. Although MW 
is related to other forms of disengagement, such as boredom, 
behavioral disengagement, and off-task behaviors [2, 3, 36], it is 
inherently distinct because it involves internal thoughts rather 
than overt expressive behaviors. This raises two challenges. First, 
while other disengaged behaviors often involve detectable 
behavioral markers (e.g., yawns signaling boredom), mind 
wandering is an internal state that can appear similar to being on- 
task [31]. Second, the onset and duration of MW cannot be 
precisely measured because MW can occur outside of conscious 
awareness [32]. 


Despite these challenges, there has been some progress toward 
automatic detection of mind wandering (discussed as related 
works in Section 1.1). However, almost all of the current MW 
detectors focus on reading. In contrast, we consider MW detection 
while students view MOOC-like lectures, building and validating 
the first gaze-based MW detector during video lecture viewing. 
We focus on video lectures because they are a core component of 
many courses and are vital to MOOCs. As MOOCs and lecture 
capture systems become more popular, we envision a variety of 
challenges with respect to keeping students engaged when content 
delivery occurs outside of the classroom with the instructor not 
even present. In this work, we harness the use of a computer in 
content delivery to take a step towards an attention-aware 
MOOCs. 
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1.1 Related Work 


In an early study attempting to detect MW in the context of 
learning [10], students were asked to read aloud a paragraph about 
biology, followed by either self-explaining or paraphrasing. 
Students self-reported how frequently they zoned out on a scale 
from 1 (all the time) to 7 (not at all). Reports were then grouped 
as either low (1-3 on the scale) or high (5-7 on the scale). 
Supervised machine learning methods were trained using 
acoustic-prosodic features to classify these instances, achieving an 
accuracy of 64%. However, it is unclear whether this detector 
could generalize to new students as the validation method did not 
ensure student-level independence across training and testing sets. 


Researchers have also built MW detectors based on information 
readily available in log files collected during the reading (e.g., 
reading time, complexity of the text). For example, [19] attempted 
to classify whether students were MW while reading a screen of 
text using reading behaviors and textual features (e.g., text 
difficulty). They were able to classify MW at 21% greater than 
chance using a leave-one-subject out cross-validation method. 
Similarly, another study [11] also attempted to predict MW during 
reading using textual features such as word familiarity, difficulty, 
and reading time. However, rather than using supervised machine 
learning, they used a set of researcher-defined thresholds to 
ascertain if participants were “mindlessly reading” based on 
difficulty and reading time. 


More recent studies have explored additional techniques to detect 
MW during self-paced computerized reading [5, 8, 11]. In these 
studies, MW was measured via thought probes that occurred on 
pseudo-random screens (i.e. screen of text similar to a page of 
text). Participants responded either “yes” or “no” based on 
whether they were MW at the time of the probe. Supervised 
classification models were trained to discriminate the two 
responses using physiological features (e.g., skin conductance, 
temperature) [8] or eye-gaze [5], achieving accuracies ranging 
from 18% to 23% above chance and validated in a manner that 
generalized to new students. Further, combining the two 
modalities led to an 11% improvement in detection accuracy 
above the best individual modality [4]. 


Beyond reading, Pham et al. [22] provide initial proof that MW 
detection is possible during lecture viewing. Students watched 
video lectures on a smart phone using a MOOC-like application 
and responded yes or no to thought probes during the lectures. 
They used student heart rate (extracted via 
photoplethysmography) to train classifiers to detect MW. They 
achieved a 22% greater than chance detection accuracy, thereby 
providing some initial evidence of MW detection in a MOOC-like 
learning environment. 


Hutt et al. [15] focused on detecting MW during learning with an 
intelligent tutoring system (ITS). Students’ eye gaze was tracked 
with a consumer grade eye tracker as they completed a 30-40 
minute learning session with the ITS. Students reported MW by 
responding to pseudo-random thought probes throughout the 
session. A variety of supervised classification models were trained 
to detect MW from eye movements and basic contextual 
information (e.g., time within session), achieving  student- 
independent MW detection that was 37% greater than chance. 


Finally, Mills et al. [18] studied MW detection in the context of 
viewing a narrative film. This study used a research grade eye 
tracker to monitor eye movements from which content-free global 
gaze features (e.g., fixation duration) as well as content specific 


features were computed. The content specific features were 
generated from two areas of interest (AOIs): one from the saliency 
map of the image [14], and one specific to the film being watched. 
These AOIs were then used in conjunction with eye gaze to 
generate content specific (local) features (e.g., average distance of 
fixations from an AOI or intersections with the AOI). The key 
finding was that, unlike in reading tasks, models built using local 
features were more successful than those built from global gaze 
features, achieving a student-independent score of 29% above 
chance. 


1.2 Current Study and Novelty 


The novelty of this paper is two-fold. First, we build the first 
gaze-based detector of MW during video lecture viewing. We 
focus on eye tracking due to well-known relationships between 
visual attention and eye-movements. For example, MW has been 
associated with longer fixation durations [25] and more blinking 
in reading [33]. We use low-cost consumer-grade eye trackers to 
collect gaze data from participants as they view a recorded lecture 
(see Figure 1). Since research grade eye trackers can cost upwards 
of $40,000, the selection of affordable equipment (less than $150) 
increases the applicability of this work, enabling its eventual 
deployment in real world learning environments such as 
classrooms or students’ homes. 


Second, we compare MW detection with the more generalizable, 
global eye gaze features to AOI based local features. Global eye 
gaze features have previously been successful for detecting MW 
in learning contexts such as reading [7] and interacting with an 
ITS [15]; however, recent work involving narrative film 
comprehension found that AOI based features were more effective 
in that context [18]. We explore if the differences in visual style 
and production techniques between a recorded lecture (Figure 1) 
and a narrative film (Figure 2) influence the effectiveness of local 
features for detecting MW. This is a critical comparison because 
the global features are much more generalizable. 


Pd a 
Figure 2. Example frame fro 


m narrative film 
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2. MW DETECTION 


2.1 Procedure 

Participants (or students) were 32 undergraduate students from a 
Canadian University, and they were compensated with course 
credit for their participation in the study. Participants watched a 
24 minute lecture on population growth and were informed that 
there would be a test over what they had learned after watching 
the video. MW was defined as “Any thoughts that are not related 
to the material being presented”, with examples such as 
“Concerns about an upcoming exam” and “Thoughts about 
dinner”. Students also had the opportunity to ask questions 
regarding the instructions before the video began, but throughout 
the process, students had no control over the video. 


Eye movements were monitored using a COTS eye-tracker called 
the EyeTribe that retails for $99. The eye tracker was placed just 
below the monitor on the desk. 


2.2 Thought Probes 


Mind wandering was measured during the recorded lecture using 
auditory thought probes, which is a standard approach in the 
literature [30]. Each student received 12 probes throughout the 
course of the recorded lecture that appeared at pre-determined 
times in the video. For each probe, the video paused and text was 
displayed on the screen asking, “In the moments prior to the probe 
were you MW?” Participants could then respond “1” for yes or 
“0” for no. Overall 31% of the probes were MW. 


It is important to emphasize a few points about the method used to 
track MW. First, this method relies on self-reports because MW is 
an inherently internal phenomenon which requires self-awareness 
for reporting [32]. Second, self-reports of MW have been 
objectively linked to patterns in pupillometry [12], eye-gaze [25], 
and task performance [23], providing validity for this approach. 
However, at this time, there are no reliable neurophysiological or 
behavioral markers that can accurately substitute for the self- 
report methodology [32]. Indeed, this is the very reason we set out 
to build gaze-based MW detectors. The limits of thought probes 
are considered further in the Discussion section. For now, we note 
that our use of thought-probes to measure MW is consistent with 
the state of the art in the psychological and neuroscience 
literatures [32]. 


2.3 Feature Engineering 

We calculated features from 30-second windows (window size 
was based on previous work [6, 15]) preceding each thought 
probe. We investigated two types of features: global gaze (from 
previous work [15]) as well as local features (based on [18]). 
Global gaze features focus on general gaze patterns and are 
independent of the content on the screen; whereas, local features 
encode where gaze is fixated on the screen. 


2.3.1 Global Features 


Eye movements were measured by fixations (1.e., points in which 
gaze was maintained on the same location) and saccades (i.e. the 
movement of the eyes between fixations). We calculated fixations 
and saccades from the raw eye gaze data using the Open Gaze and 
Mouse Analyzer (OQGAMA) [35]. We considered six general 
measures across the 30-second window (bolded in Table 1) from 
which we computed the number, mean, median, minimum, 
maximum, standard deviation, range, kurtosis, and skew of the 
distributions, yielding 54 features. We also included three other 
features (see Table 1), yielding a total of 57 global gaze features. 


Table 1. Eye-gaze features. Bolded cell indicates that nine 
descriptives (e.g., mean) were used as features (See Text) 


Feature Description 
Fixation Duration Elapsed time in ms of fixation 
Saccade Duration Elapsed time in ms of saccade 
Saccade Length Distance of saccade in pixels 


Saccade Angle Absolute Angle in degrees between the x-axis 
and the saccade 


Saccade Angle Relative Angle of the saccade relative to 
previous gaze point. 


Saccade Velocity Saccade Length / Saccade Duration 


Fixation Dispersion Root mean square of the distances of 
each fixation to the average fixation 
position 


Horizontal Saccade Proportion of saccades with relative 
Proportion angles <= 30 degrees above or 
below the horizontal axis 


Fixation Saccade Ratio ratio of fixation duration to saccade 
duration 


2.3.2 Local Features 


Local features were computed based on the relationship between 
eye movements and an area of interest (AOI). Two AOIs were 
defined for each frame of the lecture video that fell within the 
window: the most visually salient region of the frame, and the face 
of the lecturer. Visual saliency was determined using a MATLAB 
implementation of the Graph-Based Visual Saliency Algorithm 
[14] which produced a saliency map of pixel intensity from 0 to 1 
for each frame that considered color, intensity, orientation, 
contrast, and movement. Determining the most visually salient 
region consisted of removing pixels with an intensity below a 
certain threshold (starting at 60% of the most intense pixel in the 
frame), leaving one or more regions of pixels as seen in Figure 4. 


Figure 3. Example most salient region, lighter areas indicate 
higher saliency. 


If the largest region had an area less than 2000 pixels (about 2% 
of the total area and a similar size to the face AOI), it was selected 
as the most visually salient region; otherwise, the process was 
repeated with a lower threshold. Figure 3 shows an example 
selection; in this case, the lecturer is gesturing, and the hand area 
was chosen as the most salient region. The face AOI was 
computed by detecting the facial location in the video using the 
commercially available software, Emotient [38]. The software 
provided the height and width of the face as well as the location 
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which was converted into a bounding box after adding a small 
buffer of 20 pixels to account for any tracker inaccuracies. 


There were 17 features calculated from each AOI for a total of 34 
features. The features can be divided into three types: (1) AOI 
distance, (2) AOI intersection, and (3) saccade landing. AOI 
distance features consisted of descriptive statistics (minimum, 
maximum, mean, median, standard deviation, skew, kurtosis, and 
range) of the distance between the center of the AOI and the 
fixation position for each frame where the AOI was present, for a 
total of eight AOI distance features per AOI. AOI intersection 
features captured the proportion of time that gaze was within the 
bounding box or within one or two degrees of visual angle from 
the bounding box, resulting in a total of three AOI intersection 
features per AOI. Saccade landing features consisted of counting 
the number of times saccades landed on an AOI, left an AOI, or 
occurred within an AOI. To account for tracking noise, an 
additional set of saccade landing features were computed that 
counted the same events if they occurred within one degree of 
visual angle from the AOI, for a total of six saccade landing 
features per AOI. 


2.4 Model Building 


We focused on Bayesian Networks as they yielded the best 
performance compared to several other standard classifiers on this 
task in our previous work [15]. We used the default 
implementation from the Weka data mining package [13]. We 
validated the models with a leave-one-participant-out cross- 
validation scheme. For each fold, probe responses of one 
participant are held out for testing, and the model 1s trained on the 
remaining probes. This process ensures that no instances of any 
individual participant could appear in both the training and testing 
sets within a fold. This process is then repeated for the number of 
participants. 


In total, there were 384 probes during the lecture. Of those, 12 
were discarded due to insufficient eye gaze data (< 1 fixation) in 
the respective window to compute all the global features. The 
remaining 372 instances were used across all feature sets to ensure 
a fair comparison. Students reported MW in 31% of the 372 
instances, thereby leading to data skew. This imbalance between 
labels poses a challenge as supervised learning methods tend to 
bias predications towards the majority class label. To compensate 
for this concern, we use the SMOTE algorithm [9] to create 
synthetic instances of the minority class by interpolating feature 
values between an instance and its randomly chosen nearest 
neighbors until the classes were equated. SMOTE was only done 
on the training sets; testing sets were unaltered in order to ensure 
validity of the results. 


2.5 Results 


The classification results are shown in Table 2. Because our 
intention is to detect instances of MW, we focus on the precision, 
recall, and F, score of the MW class as our key metric. For 
comparison, a chance-level baseline was created by randomly 
assigning the MW label to 31% (1.e., the MW baserate) of the 
instances over 1,000 iterations and averaging the result. 


The results indicated that, while all models outperform the chance 
baseline: (1) global features outperformed local features and (2) 
adding local features to the global features increased precision but 
decreased recall, leading to no improvement in F; MW over 
global features alone. The fact that the best results were obtained 
from global features is significant because these features are more 
likely to generalize across interaction contexts. 


Table 2. MW detection results for the recorded lecture 


Feature Set F, MW _ Precision MW_ Recall MW 
Global 0.47 0.39 0.62 
Local 0.36 0.40 0.34 
Global + Local 0.42 0.45 0.39 
Chance 0.30 0.30 0.30 


3. GENERAL DISCUSSION 


MOOCs present an exciting new era for education, providing 
more resources for traditional and non-traditional students alike. 
However, little is known about user experience and student 
engagement [17] with MOOCs, and it is widely known that they 
are plagued with poor retention rates [37]. Attention is critical to 
learning, [23] and monitoring attentional states of students is a 
step towards better understanding the learning process. MW is 
one key attentional state that is negatively correlated with learning 
[21]. MW is a covert, internal state with no obvious behavioral 
markers, making it difficult to detect. Although strides have been 
made to detect MW using eye gaze in the context of self-paced 
reading, gaze-based MW detection has not yet been attempted in 
the context of recorded lectures, a key component of many 
MOOCs. This is a challenge we address in the current paper. In 
the remainder of this section, we discuss our main findings, 
potential applications, and discuss limitations and future work. 


3.1 Main Findings 


MW detection during reading is supported by decades of research 
on attention and eye movements [24]. Recent work has branched 
away from reading into more complex environments [15, 18] that 
are not afforded with predictable patterns of eye moments. We 
have shown that MW detection is possible in the context of 
viewing a recorded lecture. We were able to accurately classify 
MW with an F, of 0.47 which is a 56% improvement over chance. 
Although this result is modest, it is an important first step in 
detecting MW in this domain, especially using consumer-grade 
eye tracking equipment. 


Since MW detection in the context of online learning 1s still in its 
infancy, it is important that we explore techniques that are both 
successful and generalizable. We considered two feature sets in 
this work: global eye gaze features, which have previously 
performed well at detecting MW during reading and while 
interacting with an ITS, and local features, based on AOIs, that 
have previously been shown to be successful predicting MW 
during narrative film viewing. In the context of lecture viewing, 
we have shown that global eye movements outperform local AOI- 
based features, contrasting previous work during narrative film 
viewing [18] that found the opposite pattern. 


It is interesting to consider why AOIs were less successful in this 
context as opposed to narrative film viewing. One suggestion lies 
in the different styles of the two media. Commercial, narrative 
films are directed with the viewer in mind, directing the 
audience’s attention to whatever is pertinent. In many cases, films 
are produced by professionals with years of experience and 
numerous qualifications in their art form. In contrast, a recorded 
lecture involves far more basic film production techniques, and in 
many cases the film audience is the secondary audience; the 
lecture itself is designed for the audience in the room. Our 
methods rely on automated AOI detection. It may be that these 
style differences affect that detection, having a downstream effect 
on the features generated from those AOIs. Further research 
would be required to confirm this hypothesis. 
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All data was collected using low-cost, consumer-grade eye 
trackers (less than $150). This is a marked contrast compared to 
many research-grade trackers that can cost tens of thousands of 
dollars. Our hope is that these models can be deployed at scale 
and can be used to improve engagement and learning from 
MOOCs. For this reason, it was important to ensure that our 
models were validated in a student-independent manner which 
increases our models’ ability to generalize to new students. The 
combination of student-independent models and consumer grade 
eye tracking increases our confidence that the models will 
generalize more broadly to applications outside of the laboratory, 
though this claim requires further empirical validation. 


3.2 Applications 

Lecture videos play a major role in online learning with MOOCs, 
so our MW detectors can be quite beneficial in that context. Our 
detectors could be implemented to provide real time updates to 
the MOOC software regarding the students’ attention. Should a 
student be MW, the MOOC software could then adopt a variety of 
potential intervention strategies to refocus attention to the 
learning task. This could include simply pausing the video, 
asking a content-specific question, or asking the student to self- 
explain content that has recently been covered. Both interleaved 
questions [34] and self-explanations [20] have been shown to be 
effective in focusing attention. Students who answer incorrectly 
could then be encouraged to further review material and try again 
or could be redirected to an earlier point in the video. These 
approaches would give them multiple opportunities to correct the 
learning deficits attributed to MW. 


It is important to consider that such interventions rely on MW 
detection which is inherently imperfect. The detector may issue a 
false alarm, suggesting that a student is MW when (s)he is not, or 
it could miss that a student is MW. In our view, MW detection 
does not need to be perfect as long as there is a modicum of 
accuracy. Imperfect detection can be addressed with a 
probabilistic approach, where the detector outputs a MW 
likelihood that is then used to determine whether an intervention 
is triggered (1.e., if the likelihood of MW is 70%, then there is a 
70% chance of an intervention). The interventions should also be 
designed to “fail-soft” in that there are no harmful effects to 
learning if delivered incorrectly. 


A further application is to inform the development of future 
MOOCs. Data from students’ attention patterns whilst interacting 
with a MOOC video can be used to improve course structure (e.g. 
number of lectures and lecture length as well as course content 
such as individual explanations). 


3.3 Limitations 

We designed our approach to include a low-cost eye tracker, 
however, consumer models have a lower sampling-rate, limiting 
the accuracy of eye-gaze data compared to research-grade eye 
trackers. Furthermore, a key limitation was that we considered 
one lecture, so generalizability to other lectures is unknown. In 
addition, data was collected in a quiet lab environment; for better 
ecological validity we would need to explore more authentic 
learning environments (e.g. homes or libraries). 


A further limitation relates to the use of thought probes which 
require users to be mindful of their MW and respond honestly. 
Although this methodology has been previously validated [12, 23, 
25] there is no clear alternative to track a highly internal state like 
MW outside of measuring brain activity in an {MRI scanner. One 
futuristic possibility is to combine self-reports and wearable 


electroencephalography (EEG) as a means of collecting more 
accurate MW responses, but it is unclear if this can be done in 
more realistic contexts. 


3.4 Future Work 


The results discussed here invite several possibilities for 
improvement that we will address as future work. First, we will 
explore eye movements in different lectures. Having shown that 
global gaze models are applicable in this context, we will explore 
if we can train a model on one recorded lecture and use that model 
on other lectures and other topics. We will also explore cross 
training to other educational environments, to gain a _ better 
understanding of the differences and similarities in eye 
movements and attention across learning situations. 


Another potential avenue is to integrate the detector into a MOOC 
to detect MW in real time. Here, the MW probes will be based 
upon the detectors real time assessment of students’ attention 
instead of pre-prescribed or pseudo random probing. We can then 
better evaluate our detectors by comparing the probabilistic 
assessment of MW to students’ responses to probes. Providing 
this refinement is successful, we could then use the detector to 
create a MOOC environment that intervenes in real time. 


4. CONCLUSION 


The popularity of MOOCs has ushered in an exciting time for 
students everywhere while also bringing challenges for educators. 
Advances in consumer grade eye tracking allow us to take a step 
towards a better understanding of how students engage with 
MOOCs on a larger scale. We have shown that we can detect MW 
in recorded lectures at above chance level. While much MW 
research has focused on the context of reading, our findings 
suggest that it might be possible to apply research on eye gaze, 
attention, and learning to this new context, thereby affording new 
discoveries about how students learn and interact with MOOCs 
while designing interfaces to sustain attention during learning. 
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ABSTRACT 


The analysis of log data generated by online educational sys- 
tems is an important task for improving the systems, and 
furthering our knowledge of how students learn. This paper 
uses previously unseen log data from Edulab, the largest 
provider of digital learning for mathematics in Denmark, to 
analyse the sessions of its users, where 1.08 million student 
sessions are extracted from a subset of their data. We pro- 
pose to model students as a distribution of different underly- 
ing student behaviours, where the sequence of actions from 
each session belongs to an underlying student behaviour. 
We model student behaviour as Markov chains, such that 
a student is modelled as a distribution of Markov chains, 
which are estimated using a modified k-means clustering 
algorithm. The resulting Markov chains are readily inter- 
pretable, and in a qualitative analysis around 125,000 stu- 
dent sessions are identified as exhibiting unproductive stu- 
dent behaviour. Based on our results this student represen- 
tation is promising, especially for educational systems offer- 
ing many different learning usages, and offers an alternative 
to common approaches like modelling student behaviour as 
a single Markov chain often done in the literature. 


Keywords 
Markov Chains, Sequence Modelling, Clustering 


1, INTRODUCTION AND RELATED WORK 


How students interact with educational systems is today an 
important topic. Knowledge of how students interact with a 
given system can give insight in how students learn, and di- 
rections for the further development of the system based on 
actual use. The interaction can be studied both by explicit 
studies [7] directly observing student interaction in situ, or 
by the use of log data collected automatically by the use of 
the system as done in this paper. 


Analysis of log data is often viewed as an unsupervised 
clustering problem at the student level [4, 8]. Our work 
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takes another direction and focuses on the action sequence 
level. For clustering sequences, Markov models are popular 
as they provide a convenient way of modelling the transi- 
tions and dependencies of the sequences [9]. For action se- 
quence mining, both hidden and explicit models have been 
used depending on the tested hypothesis, and on whether 
the states are explicit or implicit. Beal et al. use hidden 
Markov models for student prediction, assuming underly- 
ing hidden states of engagement, which can be clustered [2]. 
Kock and Paramythis use explicit states for analysing prob- 
lem solving activity sequences, as the states in this case are 
explicit and therefore appear directly in the log [9]. 


The choice of clustering of the Markov models depends on 
the application area. Klingler et al. did student mod- 
elling by the use of explicit Markov chains, and the clus- 
tering was done by different similarity measures defined on 
the Markov chains themselves [8], e.g. euclidean distance 
between transitional probabilities, or Jensen-Shannon Di- 
vergence between the stationary probabilities of the chains. 
When individual sequences are clustered, an underlying as- 
sumption of the data coming from a mixture of Markov 
chains has been used [10], where the individual chains rep- 
resent the cluster centres, and the task is finding both the 
chains and the mixing coefficients. 


The work presented in this paper is using discrete Markov 
chain models for action sequence analysis, on log data’ ac- 
quired from the company Edulab. Edulab is the largest 
provider of digital learning for mathematics in Denmark, 
having 75% of all schools as customers, and receiving more 
than 1 million student answers a day. Using a mixture of 
Markov chains, we assume that each chain will represent a 
prototype student behaviour. So the underlying assumption 
in this work is that each student can be modelled as behav- 
ing according to some underlying behaviour during each ses- 
sion, and a student can then be seen as a distribution over 
different behaviours. Edulab’s product offers many different 
ways of learning mathematics, ranging from question-heavy 
workloads to video and text lessons, and other activities de- 
pending on whether the student is in class or at home. This 
allows to model a student as "distributed" over different be- 
haviours, in contrast to a single student behaviour model of 
how the student usually interacts with the system. 


We reason that mixture of Markov chains will allow for a 
qualitative study of what type of behaviour each chain rep- 


‘The data is proprietary and not publicly available 


232 


resents, and thus ultimately it can be used to show how a 
student uses the educational system. 


Mixtures of Markov models can be solved by the EM al- 
gorithm, which however is notoriously slow to run for large 
amounts of data, and only local optimal solutions are found 
[6]. In this paper we need fast processing in order to anal- 
yse the large amounts of data produced by Edulab, so we 
simplify the assumptions on the underlying Markov chains, 
which allows for a modified version of k-means clustering. 


Initial cluster centres, representing underlying student be- 
haviour, can be chosen by domain experts and then refined 
through the clustering. However, since the true number of 
underlying clusters is unknown, it is difficult for an expert 
to predefine sensible cluster centres for a range of different 
numbers of clusters. In this work we first perform simula- 
tions to consider the effect of starting at the correct locations 
versus adding noise to the correct location until the start- 
ing points are completely random. Based on these results 
clustering is done on the Edulab dataset, and a qualita- 
tive analysis is performed on the resulting Markov chains. 
This shows how students are distributed among the Markov 
chains, and how unproductive system usage can be detected 
using the Markov chains. 


In summary the primary research questions this paper ad- 
dresses are: 1) to what extent can students be modelled 
as a distribution over underlying usage behaviours which is 
changing across sessions, and 2) how this modelling leads 
to insight in future improvements of the system for the pro- 
ducers of educational systems. 


2. DATA 


The data used in this work is produced by matematikfes- 
sor.dk, a Danish mathematics portal made by Edulab that 
spans the curriculum for students aged 6 to 16. The web- 
site offers both video and text lessons in combination with 
exercises covering the whole curriculum, such that it can be 
used as a primary tool for learning, and not only supplemen- 
tary. Log data generated by the grade levels corresponding 
to students of age 12 to 14 for the 2016 school year is used 
(from August 2016 to February 2017). An action in this 
system can either be watching a lesson, which contains ei- 
ther a video or text description, or answering a question. 
Lessons and questions both have a topic id, specifying the 
general topic of the question or lesson. The data statistics 
are summarized in Table 1. ‘The lessons and questions can 
be assigned as homework or done freely by the students (this 
study does not differentiate between whether it is homework 
or not). It should be noted that a lesson takes significantly 
longer time doing than answering a question hence the lower 
ratio of lessons, compared to other actions, in Table 1. 


The logs do not contain information about when a session 
is started or finished, so we define a session as a sequence of 
actions, where the time between two actions is less than 15 
minutes. A student has on average 12.5 sessions (standard 
deviation of 13.3), and the histogram of the number of ac- 
tions in each action sequence can be seen in Figure 1, where 
sequence lengths larger than 200 have been removed from 
the plot for the purpose of visualization. When a student 
interacts with the system his actions are stored and seen as 


Proceedings of the 10th International Conference on Educational Data Mining 


Sequence length distribution 


80000 


60000 


counts 


40000 


20000 


0 25 50 75 100 125 150 175 200 
sequence length 


Figure 1: The distribution of action se- 
quence lengths with lengths larger than 200 
removed. 
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an action sequence, an example of one is: 


t t t t t t t 
Qry’ ’ Qwa ’ L3 ’ Qw, ’ Qrs" ’ Qre ’ Qr, (1) 


Qr is a correctly answered question, Qw is an incorrectly 
answered question, and L is a lesson. The subscript denotes 
the action number in a temporal ordering, and the super- 
script denotes the topic id, which is associated with each 
lesson and question. 


3. METHOD 


Our method for action sequence clustering will be explained 
in this section, and is based on modelling interactions with 
the system as Markov chains. Our Markov chain model with 
its transitions is shown in Figure 2. Our model consists of 
8 states as will now be explained with their abbreviations 
in parentheses. These abbreviations are used for visualizing 
the resulting Markov chains from the clustering. The first 
two are start (S) and end (E). The rest consists of three gen- 
eral states: Doing a lesson (L), answering a question right 
(Qr), or answering a question wrong (Qw). Each lesson and 
question have an associated topic id, which might change 
from action to action creating the last three states: doing 
a lesson in another topic than the previous action (L_c), 
answering a question right in another topic (Qr_c), and an- 
swering a question wrong in another topic (Qw_c). If we 
consider the sequence described in Equation 1, then that 
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would correspond to visiting the following states 


S—>Qr-7>Qu_ col cH 
Qw_c>Qr co Qr-QroE (2) 


The pipeline for clustering has the following procedure. 


Start 
session 
1. For every session we extract a sequence of actions Aj,..., An, da 


and each action sequence corresponds to a path in the 
used Markov chain model. 


2. Since the Markov chains are unknown, priors P,, ..., Pr 
(which themselves are Markov chains) are generated at 
random such that each edge shown in Figure 2 has a 
transition probability taken uniformly at random from 
O and 1. Each random chain is normalized such that 
each state’s outgoing transitional probabilities sum to 
one. These priors function is the pendant to the usual 
initial cluster centers, which most often are random 
data points. Generating a Markov chain from a ran- 
domly chosen point would however not work in our 
case, since many zero valued transition probabilities 
would occur. 


3. Each action sequence is assigned to the prior which 
was most likely to generate it, i.e. 


m 


arg max I]. nee (3) 


l<7j sk 


where Pi, b, 1S the transition probability from state 
b;-1 to 6; in prior P;, m is the number of transitions 


between states, and k is the number of priors. 


4. After each action sequence has been associated with 
a prior, then each prior is updated by generating the 
Markov chain most probable given its associated ac- 
tion sequences. This is done by counting the state 
transitions in each sequence in a new Markov chain 
model, and normalizing afterwards. 


5. Points 3 and 4 are ideally reiterated until convergence, 
i.e. no action sequence changes its associated prior. 
However for computational reasons we stop iterating 
after less than 5% of the sequences have changed their 
assigned prior. 


The clustering technique is very similar to ordinary k-means 
clustering, with the major difference that the clustering is 
not dependent on a similarity measure directly on the se- 
quence, but dependent on the Markov chains generated by 
the clustering. Comparing to ordinary k-means clustering, 
the produced chains in each iteration are analogous to the 
ordinary cluster center found by some mean. The mixture 
model could also be estimated by the EM algorithm [1], 
which has the benefit that sequences that do not belong to 
a single clear cluster, i.e. that have multiple highly prob- 
able chains, will weight in on all of them. This has the 
downside that clusters take longer to be separated, and the 
convergence is therefore slower. Under the assumption of 
the chains being distinct, each sequence will mostly weight 
on a single chain, and here the k-means clustering method 
and EM algorithm will perform very similarly. For the data 
from Edulab we assume most of the chains to be distinct, 
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Figure 2: Markov chain representing the 
possible states and transitions. Note the tran- 
sitions each way do not have to be equal. 


but not necessarily all. In addition a very large number of 
sequences will have to be clustered in the future when the 
full dataset is used, and not restricted as done for this paper. 
We are therefore mostly interested in how well the k-means 
clustering approach performs as it is more computationally 
feasible when the data size is increased. 


The above procedure leaves two challenges: 1) How do we 
know the resulting Markov chains are close to the real ones? 
and 2) How to estimate the number of priors? We address 
these points next. 


The first point is dealt with using synthetic data, where k 
random Markov chains are made, and each action sequence 
is generated from one of those chosen uniformly at random. 
In order to ensure a suitable length of the generated action 
sequences, the ingoing probabilities to the end state are fixed 
to allow for an average sequence length of 20. After gener- 
ating the synthetic data, the most probable Markov chain 
for each sequence is assigned as its label, and the goal in the 
clustering is to be able to capture these clusters. Note, that 
since each sequence is randomly generated using the chosen 
Markov chain, then its most probable Markov chain might 
not be the one generating it. ‘To determine the ability to 
capture the original clusters we consider the average purity 
of the resulting clusters: 


>3 maxi<j<k ne? NS: |) (4) 


Averagepurity = 


Where 5S; is an estimated cluster, C; is the true cluster, n is 
the number of clusters, and k is the number of true clusters. 
An average purity of 1 represents that the method fully cap- 
tures the original clusters. The underlying Markov chains 
are unknown on real data, so increasingly noisy versions of 
the underlying Markov chains are experimented with as pri- 
ors, to show how the method is expected to perform under 
real circumstances. 


session 
, 
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In the case of real data, the true underlying Markov chains 
are unknown, so in this case the sum of the log likelihoods 
is calculated for the sequences to their most probable prior: 


sum of log likelihood = Ss” log (L(s;|P; )) (5) 
i=1 


where s; is an action sequence, P;" is the prior most likely to 
generate action sequence s;, and L(s;|P;") is the likelihood 
that P.* generates s;. 


The second point mentioned earlier, about estimating the 
number of priors, can be solved using either the average pu- 
rity in the synthetic case, or from the sum of log likelihoods 
in the real case. ‘The sum of log likelihoods as a function of k 
will be monotonically increasing, but the slope will decrease 
as k exceeds its true underlying value. Since the method 
starts with randomly chosen priors, it is repeated a number 
of times, and the solution with the largest log likelihood is 
chosen for each value of k. 


4. SIMULATED EXPERIMENT WITH 
NOISY PRIORS 


There are two approaches for estimating the Markov chains 
for the Edulab data set. 1) The prior Markov chains can 
be chosen by domain experts - by specifying common se- 
quences we would expect to find in the data, and then refine 
them during the clustering. 2) The second approach is as de- 
scribed in the method section, starting with random chains, 
and running k-means multiple times, and taking the clus- 
tering which gives the highest sum of log likelihoods. To 
measure how the method behaves as the initial priors are 
increasingly noisy versions of the underlying Markov chains, 
k-means is run with the priors chosen as: 


P; = (1—-—a)P; + aPrana (6) 


Where all Ps are Markov chains represented by matrices of 
transitional probabilities, and a@ is the noise parameter. P; 
is the i*” prior, P* is the i*” underlying Markov chain used 
when generating the synthetic data, and Peang is a random 
Markov chain. The higher a, the more noisy the initial prior 
is. 


In Figure 3, we see how the average purity behaves as a 
function of noise parameter a. The experiment is run for 
k = 6, and 6 random chains are generated. The transition 
probabilities to the end state are fixed at 0.05 for all states 
for all chains to allow for sequences of average length 20. 
50000 sequences are sampled uniformly from the 6 chains. 
The modified k-means is then run with the priors varying 
depending on a, and the experiments are run 10 times and 
purity is the average over the 10 runs. First we note that 
even with using the modified k-means algorithm and not 
the EM algorithm the resulting average purities are quite 
high. It is seen that even with a = 1 representing com- 
pletely random priors, the reduction in purity is not too 
large compared to starting with the same priors as the data 
is generated from. Even starting with the same priors which 
generated the data does not guarantee perfect purity, which 
is expected as there are some sequences that are almost as 
likely under multiple chains, so small differences in the data 
determined Markov chains will move them from one chain 
to another. Based on the above result we will not define 
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Figure 3: Average purity as a function of 
increasingly more noisy priors. A completely 
random prior (1.0 on the x axis) is able to 
perform well. 


the priors by an expert, and instead let them be random. 
This has the benefit of being more manageable than hand- 
crafting specific priors for each choice of k, which would be 
very difficult to do in a meaningful way when k is large. 


5. REAL DATA EXPERIMENT 


5.1 Choosing the number of clusters 

The problem of determining the number of clusters is com- 
mon for all unsupervised learning tasks. In this paper we 
consider the sum of the log likelihoods for the action se- 
quences. A common approach is the use of the "elbow" 
heuristic, where the choice of k is chosen based on the slope 
of the sum of log likelihoods function over k. 


In order to argue that there is structure in the data, and that 
the method is able to capture this structure, a randomized 
experiment is made. The randomized experiment consists 
of randomly permuting each sequence (but keeping the start 
and end states), and seeing how the sum of log likelihoods 
is affected by it. If there is no structure originally in the 
sequences, then one can not expect it to perform better than 
the permuted data. 


In Figure 4 we see that the sums of log likelihoods are con- 
siderably lower in the permuted data set, with only slightly 
higher sum of log likelihoods when k = 20 compared to 
k = 2 for the real data set. The action sequences therefore 
have structure which the Markov chain captures, and it is 
therefore not just random chains that the k-means clustering 
produces. Since the chains capture some inherent structure 
in the data, it is meaningful to analyse the individual chains 
with regards to what user behaviour they capture. 


There is not an obvious breaking point in the sum of log 
likelihoods, but the increase before k = 6 is large, while the 
increase for k > 10 is notably smaller, so a value of k between 
6-10 is sensible. We will in the qualitative assessment of the 
chains use k = 6. 
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Figure 4: 


5.2 Qualitative assessment of Markov chains 
This section will make qualitative assessments of what the 
different resulting Markov chains represent with regards to 
what type of user behaviour they capture. Even with six 
chains there is some similarity between some chains, so in 
this section we will focus on the three most distinct chains 
shown in Figure 5. The thickness of the arrows is propor- 
tional to the transitional probability for each state, except 
the ending state. The transitional probabilities are sorted 
and only drawn until 70% of the probability mass is cov- 
ered. For the ending state, 70% of the incoming transitional 
probabilities are drawn. 


In general not all chains can be described as either being a 
positive or negative usage of the system. Chain 2 captures 
usage where most of the questions being answered are ei- 
ther right or wrong, and there is very little mixing between 
taking lessons and answering a question. Usage like this 
could indicate an unproductive session for students, since 
they are mostly getting all questions right or all questions 
wrong, and research shows that students feel more intrin- 
sic pleasure when the difficulty level is slightly challenging 
[5] leading to more engaged sessions [3]. Similarly, watch- 
ing lessons without engaging with the material via questions 
leads to students not training the learned material, which is 
important for the learning process. 


Chain 6 can be described as a positive usage of the sys- 
tem, as the most probable transitions lead to a question 
being correctly answered, except for the two transitions in 
the lessons. Generally students are focused on one topic at 
a time. 


Chain 4 has high transitional probability when switching be- 
tween topics, so this could indicate a session with a primary 
focus on repetition as the topic is varying, and students most 
often answer questions from another topic than the watched 
lessons. 
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Chain 2 


Chain 6 


Figure 5: Chains 2, 6, and 4 of the six chains. 
The thickness of the arrows is proportional to 
the transitional probability for each state, ex- 
cept the ending state. The transitional prob- 
abilities are sorted and only drawn until 70% 
of the probability mass is covered. For the 
ending state 70% of the incoming transitional 
probabilities are drawn. State abbreviations 
are explained in section 3. 


Po Num. sequences | Avg. sequence lengt 


chain 1 | 295,792 34.81 
36.58 
chain 95,1736 0.19 
chain 4 | 131,460 28.79 


Table 2: ‘The number of sequences and aver- 
age length of sequences for each Markov chain 


The distribution of the sessions over the chains can be seen 
in ‘Table 2. 


The length of the sequences is varying, but no single chain 
in general captures either the very short or very long se- 
quences. Instead a combination of shorter and longer se- 
quences is captured by each chain. The most common chain 
can be seen in Fig 6. This chain is similar to chain 4 (Fig 
5), but with more topic changes and more wrongly answered 
questions when changing topics, which can be seen in the self 
loop for Qw_c. Chain 4 is also shorter on average. As seen 
in Table 2, generally all six chains contain a large amount 
of sequences on average. This indicates that the system us- 
age does indeed vary, and is not limited to all sequences of 
the same length defining the same use of the system. If one 
considers each user’s distribution of Markov chains, then on 
average each user has 3.5 different types of sessions out of 
6 with a standard deviation of 1.5. This supports the as- 
sumption that a single Markov chain is not optimal for user 
profiling for educational systems similar to the one generat- 
ing our data, where there is a lot of user freedom in what 
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Chain 1 


Figure 6: Chain 1, the most common chain. 
State abbreviations are explained in section 3. 


activities they engage in. 


6. DISCUSSION AND CONCLUSION 


In this work first order Markov chains have been used, but 
it is generally known that the action sequences do not ful- 
fil the Markov property of transition to a state only being 
dependent on the previous state. No order of Markov chain 
will completely capture the underlying transition between 
states, as the usage is dependent on many external factors 
which are unknown, but higher order chains would be able to 
capture more complex dynamics in the usage. Even though 
the Markov property is violated, Markov chains are still very 
widely used in educational data mining [4, 8], and provide 
a good tool for comparisons of action sequences across dif- 
ferent lengths, focusing on the flow of actions taken. In 
future work an interesting extension would be considering 
time dependent Markov models, such that the transitional 
probabilities are dependent on how long the states have been 
unchanging. This would allow for more interpretative mod- 
els, e.g. we could see when the probability of a session ending 
gets high. 


When inspecting the Markov chains produced by the cluster- 
ing, chain number 2 indicated suboptimal or unproductive 
usage of the system, where the students either experience 
questions that are too easy or too hard, or never train what 
they learn in the lessons. The chain has 126,683 sessions 
in its cluster, and it is therefore a significant amount of 
sessions where the learning outcome most likely could be 
improved. Based on this it could be recommended to have 
a few obligatory questions after a lesson to strongly encour- 
age the student to use what they have just learned, and 
detect negative spirals where the students are always wrong 
by recommending lessons to help the student move forward. 


Modelling the student as a distribution over Markov chains, 
which can be considered usage patterns, results in a vector 
representation of the individual students. This represen- 
tation allows to apply standard techniques directly on the 
student model, compared to working on more complex stu- 
dent models. An example is the issue of drift in student be- 
haviour over time, corresponding to some learning, or wider 
cognititive development of the student. This problem has 
also been considered in a similar context in [8], where dis- 
tances between single Markov chains on a student level were 
estimated. However, in our setting standard methods could 
readily be used to detect this type of drift and potentially 
alert the teacher. 


The work presented shows a qualitative study of the pro- 
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posed student representation, and experiments using syn- 
thetic data show that our methodology is able to capture 
the underlying generative Markov chains very well, when 
the number of chains has been estimated. A source for fu- 
ture work will be using the student vectors in a predictive 
task, such that quantitative measures can be acquired. An 
interesting path would be using knowledge tracing methods 
over the different session types, to see if there are any un- 
expected differences between the knowledge acquired by the 
student depending on the type of session - i.e. the kind of 
Markov chain the session originates from. 
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ABSTRACT 


Question and answer forums are becoming more popular as 
increasing numbers of lifelong learners rely on such forums to 
receive help about their learning needs. Stack Overflow (SO) is an 
example of such a forum used by millions of programmers. The 
ability of users to receive timely answers to questions is crucial to 
the sustainability of such forums and for successful lifelong 
learning. In SO we have observed that the number of questions 
answered within 15 minutes have diminished with more questions 
taking a longer time to get answered or remaining unanswered in 
some cases. This suggests the need for an effective approach in 
predicting prospective helpers who can provide timely answers to 
the questions. In this paper, we seek to explore strategies to match 
helpers and help seekers. In particular we wish to use these 
strategies to predict which SO users will provide timely answers 
to questions asked in SO, and then compare these predictions to 
the users who actually answered the questions. In making these 
predictions we looked at 3 time frames of user data: 1 month, 3 
months and 6 months. We used 5 basic strategies: frequency, 
knowledgeability, eagerness, willingness, recency; and we 
compared the success rates of each strategy in making predictions 
on 3 different success criteria: predicting the first answerer, 
predicting the answerer most liked by the asker of the question, 
and predicting the answerer rated most highly by other SO users. 
We then incorporated a timeliness measure, which takes into 
consideration how quickly the user provides answers to questions 
in the past, which helped us to achieve a higher success rate. The 
results of our study are an improvement over a similar previous 
study of SO and we hope will form the basis of methods for 
recommending peers in online forums who can provide just-in- 
time help to lifelong learners as their knowledge needs evolve and 
change. 


Keywords 


peer help, lifelong learning, peer matching 


1. INTRODUCTION 


Professional lifelong learners depend on online learning forums to 
help to meet their learning needs [2]. Our research is focused on 
supporting lifelong learners as they interact in such open-ended 
learning environments. Stack Overflow (SO) is an example of an 
online question and answer (Q&A) forum which supports millions 
of programmers. Over time, the answer response times to 
questions have increased and the number of unanswered questions 
has also increased. According to Asaduzzaman et. al. [1], failure 
of the questions asked to attract expert users is the top reason for 
unanswered questions, accounting for about 21.75% of 
unanswered questions. Receiving prompt answers to questions is 
important to the sustainability of a Q&A forum [2] and for 
successful lifelong learning. 


While research efforts have been employed in the past in 
predicting potential peer helpers within a classroom-learning 


environment which encompasses just hundreds of students [4, 
8,10], a new challenge arises in an online learning environment 
that is open ended with thousands or millions of potential helpers 
with varied expertise and learning interests. The need for an 
appropriate recommendation technique that scales up to millions 
of available users', and also aligns with the knowledge, interests 
and competency of the helper could be necessary. Greer et al. [4] 
in their study (similar to other studies [3,8,10]) employed the 
availability, helpfulness, technical ability and social ability of the 
helper as strategies considered in selecting the appropriate peer 
helper from the available users. 


In a previous study using SO users as surrogates for lifelong 
learners, we employed a tag-based Naive Bayes model to predict 
the answer performance of users using their previous activity in 
the forum [6]. The possibility of this model to predict poor 
answers even before they are provided could be used to help to 
reduce the frequency of poor answers within SO. In this new 
study, our goal is to predict helpers who are likely to provide 
answers to users’ questions quickly (“just-in-time”). We also aim 
to determine how much information about the user is sufficient to 
predict the helper (to deal with issues such as those raised by Kay 
and Kummerfeld [7] about how much information must be 
usefully retained about the user in lifelong learning contexts). 
Finally, we compare the results from this study with the topic 
modelling approach used by Tian et al. [9]. We hope this study 
will augment such studies as [3, 4, 8, 10] in providing peer helper 
seeking strategies that scale to very large numbers of users. 


2. RELATED WORK 


In supporting learners in computerized learning environments 
human helpers and intelligent agents have been employed. 
Merrill et. al. [8] compared the help provided by peer helpers with 
that provided by intelligent agents and conclusions from this study 
show that human helpers provide more flexible and subtle help. 
Similarly, Greer et al. [4], building on earlier work in finding peer 
helpers in workplace environments [3], built the iHelp system to 
help computer science students find potential peer helpers among 
their classmates who are ready, willing and able to help in 
overcoming impasses. In addition, Vassileva et al. [10] in their 
study with iHelp incorporated the social characteristics of the 
helper into determining an appropriate helper, gleaned from the 


' We will use the term “user” in this paper rather than “learner” 
when specifically discussing SO users since they are likely not 
explicitly learners in their own minds. However, in the future 
most professionals will be using such forums to meet their 
lifelong learning goals. The term “learner” then will be highly 
appropriate. Since our research is aimed at helping develop 
tools for such professional lifelong learners, especially tools that 
support personalization to each such learner, it is, we believe, 
deeply and broadly relevant to advanced learning technology. 
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online activities of the helper such as votes received by the helper, 
questions asked, answers provided, and the marks received on 
assignments. 


While these studies [3,4,10] have all successfully recommended 
just-in-time helpers for a relatively small number of students 
within classroom and workplace settings, in a typical question and 
answer forum, the number of users ranges from thousands to 
millions of users with more varied knowledge interests [5]. The 
sustainability of such a large-scale question and answer forum is 
dependent on providing quick responses to questions [2]. A study 
by Bhat et al. [2] reveals that in Stack Overflow, although most of 
the questions get answered in less than 1 hour, about 30% of the 
questions have a response time of | day with about 344,000 
questions having a response time greater than | day. In addressing 
the increasing number of unanswered questions, Bhat et al. [2] 
revealed the importance of assigning appropriate tags to 
questions; Asaduzzaman et al. [1] predicted how long a question 
will remain unanswered; and Tian et al. [9] predicted the best 
answerers to questions using a topic modelling approach. Yang 
and Manandhar [11] identified the topic modelling approach as a 
less effective approach that is too general while the use of 
question tags was proposed as a more informative approach. The 
study by Tian et al. [9] in predicting best answerers achieved a 
success rate of 21.5% while recommending 100 users who could 
answer the question. This reveals the need to explore other 
methodologies in predicting best answerers to questions. 


3. ANALYSIS OF QUESTION RESPONSE 
TIME AND UNANSWERED QUESTIONS IN 
STACK OVERFLOW 


SO is a question and answer forum that provides a platform to 
support millions of programmers by providing opportunities for 
them to ask questions and obtain answers from peers [5]. In cases 
where users do not receive answers form their peers, the user 
could provide answers to their own questions or sometimes, the 
questions remain unanswered. Key to the success of such a forum 
is the ability of users to receive prompt answers to their questions 
[2]. We studied the answer response time of questions in SO from 
January 2009 to December 2015, the distribution of questions 
answered by question askers themselves, and the proportion of 
unanswered questions. We defined the answer response time as 
the time difference between the times when a question is asked to 
when it receives the first answer. Figure 1 shows the answer 
response time of questions for each of 6 defined time intervals 
(within 15 minutes, within 1 hour, within 1 day, within 1 week, 
within 1 month and over a month) for each year under 
consideration. 


Figure 1 shows that the majority of questions in SO get answered 
within 15 minutes, although we also observe a continuous 
decrease over time in the percentage of questions answered within 
15 minutes. In fact, in 2015 just 36% of the questions were 
answered within 15 minutes compared to 2009 when about 57% 
of the questions were answered within 15 minutes. Also, 
questions with response times above 15 minutes have continually 
increased. In fact, some of the questions which received late 
answers were actually answered by the question askers 
themselves. Specifically, the total number of questions in this 
category has increased from 1,946 in 2009 to 18,479 in 2015 as 
shown in Table 1. In fact, some of these questions never get 
answered. Figure 2 shows a rapid growth in the number of 
unanswered questions. 
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Figure 1: Response Time between Question Creation Date and 
First Answer Creation Date 


Table 1: Questions Answered by the Question Asker 
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Figure 2. Number of Unanswered Questions 


While this growth is partly a result of an increase in the number of 
questions asked in SO, we believe a growth from 1,541 in 2009 to 
324,643 in 2015 is worth addressing. Moreso, Asaduzzaman et. al. 
[1] identified that the inability of questions to attract expert users 
is one of the main reasons they remain unanswered. Of course, not 
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receiving answers to questions or having to answer your own 
question yourself could deter the user from subsequently using the 
forum. The goal of our research is to support users who depend on 
online forums to receive answers to their questions. We believe 
the ability to predict prospective answerers for questions is the 
first step at supporting users to achieve this goal. 


4. RANKING STRATEGIES 


Results in section 3 suggest the need to support users in question 
and answer forums with the aim of decreasing the answer 
response time to questions. Our study seeks to predict such 
potential just-in-time peer helpers using 5 strategies for choosing 
such a helper. Each of these strategies considers the relevance of 
the question to online activities and the demonstrated knowledge 
in answers of the potential helpers (other users) in the past (we 
defined this by the co-occurrence of tags contained in the question 
with tags contained in the answers provided by the potential 
helper in the past). For each proposed strategy, personalized 
scores are assigned to each prospective helper based on their 
suitability to answer a question, as described below. 


4.1 Frequency 

The frequency strategy measures how frequently the prospective 
helper has answered questions relevant to a particular question 
under consideration in the past. The higher the frequency of 
interaction with relevant questions in the past, the more likely the 
user would be to answer the question. The frequency score was 
computed by counting the number of answer posts A relevant to 
the question tag(i) for user u as shown in equation | below: 


Scorel?*4 = >» A@u (1) 


The prospective helpers with higher scores are ranked as better 
helpers based on this strategy. 


4.2 Knowledgeability 

Knowledgeability shows how much a prospective helper knows 
about the question based on the number of up votes the user has 
earned in answering past questions with the same tag (in SO 
questions and answers are voted upon to show how useful and 
appropriate they are). This is computed as shown in equation 2 
below: 


know _ 


Score, S Upvotes (A(i),) (2) 


that is the sum of all upvotes to answer posts A relevant to 
question tag(i) for user u. Prospective helpers with a higher 
number of up votes would be ranked as better based on this 
Strategy. 


4.3 Eagerness 

Eagerness is based on monitoring the online activity of a 
prospective helper as depicted by the proportion of answers they 
have provided in the past relevant to the question compared to the 
total number of answers provided by the user to all questions, as 
shown in equation 3 below. The eagerness measure depicts the 
probability that a user will answer a question related to tag (i): 


Score!’@4 
Score, ," = or ea (3) 
uu 


N{ represents the total number of answers provided by the user to 
all questions. This strategy seeks to measure the interest of the 
user in answering questions related to tag(i) by considering the 
proportion of relevant questions answered. We assume that users 
will provide more answers to questions they are more interested 
in; therefore the higher the proportion of relevant questions 


answered, the higher the likelihood the helper would be interested 
in answering the particular question under consideration. 
Prospective helpers with higher scores are ranked higher. 


4.4 Willingness 


This measure is a combination of how active and eager the user 
has been in answering questions related to the question tag in the 
past. That is, a user who is eager to answer questions like the 
question under consideration and has answered such questions a 
lot should be more willing to answer the question under 
consideration. The Bayes theorem is applied in computing this 
peer matching measure as shown in equation (4) below: 


P(tagWi)|U7t) * PUD) 
P(tag(i)) 
where P(tag(i)|U) is the likelihood of an answer to a question 


related to tag(i) will be given by a user u, which is computed as 
shown in equation (4a) below: 


P(UZ|tag(i)) = (4) 


re 
Score!’ 4 


. ay — ul 

P(tag(t)|Ui) = — (i). (4a) 
N(i)q represents the total number of answers provided to tag (i) 
by all users. P(U{7) is the prior probability of a user u answering a 
question related to tag(i) which is equivalent to the eagerness of 
the user as computed in equation (3) above. P(tag(i)) is the 
probability that a question related to tag(i) will be asked (this is 
the same for all prospective helpers). To maximize the posterior 
probability as shown in equation (4), the numerator is maximized 
since the denominator is common to all the prospective helpers. 
The willingness score is therefore computed as shown in equation 
(4b) below (we substituted values from equation (4a) and (3) into 
equation (4)): 


Score! = —__u 
= N()a ™ 


Prospective helpers with higher willingness score are ranked 
higher. 


4.5 Recency 

The recency strategy corresponds to how actively and recently the 
prospective helper has provided answers to relevant questions. 
The recency score is computed for each prospective helper based 
on the timestamp of the latest answer A provided relevant to the 
question tag(i) as shown in equation 5 below: 


Score;;° = latest(Time A(i)), (5) 


This simply means that the recency score for a user u who has 
provided answers A to questions with tag(i) will be the 
timestamp of their latest answer (the maximum time). Under this 
measure prospective helpers who have answered related questions 
more recently would be ranked higher than those who answered 
such questions earlier. As the interests of potential helpers could 
evolve [5], providing answers to relevant questions in recent times 
could imply the prospective helper is still interested in answering 
questions related to the question tags. Although Greer et al. [4] 
argued that helpers who have recently provided help should be 
exempt, to avoid overworking a peer helper in SO, this might not 
be as true, as users might still be willing to provide help with the 
goal of earning some incentive from the forum (this could be the 
earning of a reputation score or of various badges). 
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5. EXPERIMENTAL EVALUATION AND 
RESULTS 


The goal of our study is to explore the effectiveness of different 
peer-helper matching strategies in terms of their ability to predict 
a relevant peer-helper who will provide quick answers. For each 
of the strategies described in section 4, we evaluated their 
effectiveness using the historical SO data of each prospective 
helper going back 1 month, 3 months and 6 months from the time 
a question was asked. For this study we only focused on java? 
questions (53,731 of them) that received at least one answer 
within the first hour of creation with 254,766 prospective helpers 
to choose from. These represent questions that were answered 
fairly much in time which we feel would provide a good rationale 
in evaluating the effectiveness of the various strategies in 
predicting the just-in-time answerers. Likewise, we regarded only 
users who were available online within the first hour the question 
was created to be users who would be prospective helpers, as in a 
real life situation; they are the set of users who are more likely to 
view the questions earlier and provide quicker response. Also, we 
employed the one hour time frame in defining the online users as 
it aligns with the time frame of the questions considered in this 
study. 


We also need a success measure for our predictions. Similar to the 
study by Tian et al. [9], we deem it a success if a user in the top N 
ranked users computed by a strategy is also a user who actually 
answered the question under consideration in SO. The success rate 
S@N for each strategy can then be computed by dividing the total 
number of successes by the total number of questions as shown in 
equation 6 below. 


Total Number of Successes 
S@N = -— «100% Ss ((6) 
Total Number of Questions 
We can use different values of N to get a glimpse into how our 
prediction would perform as the number of prospective helpers 
predicted increases. In our study we used N = 1, 5, 10, and 20. 
Finally, we wanted to compare the effectiveness of our strategies 
in three different prediction criteria: predicting the answerer who 
responded first in SO, predicting the answerer who gave the best 
answer according to the user who asked the question, and 
predicting the answerer whose answer other SO users ranked as 
having the best score. 


Predicting the first answerer: This criterion evaluates 
the ranked list of prospective helpers predicted for each of the 
strategies with the aim to know their effectiveness at predicting 
the user who will first provide an answer to the question. The 
results in table 2 show that considering the willingness of a 
prospective helper has the highest success rate of 55.86% with 
S @20 using a time frame of 6 months. 


Predicting the best answerer: In SO, from the 
numerous answers provided to a question, the question asker can 
mark only one of the answers as accepted which indicates the best 
answer according to the asker [9]. The goal of this evaluation 
criteria is to determine the success of the measures at identifying 
the best answerer from the ranked list of prospective helpers 
suggested. The results are shown in table 3 below. As in 
predicting the first answerer, we observed that the willingness 


We focused on questions containing java tags as this is the most 
used programming related tag in SO. 


peer matching strategy has the highest success rate of 54.62% 
with S@20 using the 6 months defined time line. 


Predicting the answerer with the highest score: 
Other community (SO) members also have the privilege to vote 
on the answers provided if they wish. In some cases the answer 
voted as best by the question asker might not necessarily be the 
answer with the highest score according to the community. With 
this evaluation criterion we want to examine the effectiveness of 
the peer matching strategies at predicting the user with the highest 
score. Results from this evaluation are shown in table 4 below. 
Amongst the 7 strategies considered, again we observed 
willingness of the prospective users has the highest success rate at 
predicting the user who obtained the highest success with a 
success rate of 56% with S@20 using the 6 months defined time 
line. 


Overall, with the 3 evaluation criteria we achieved the highest 
success rate with the willingness measure and the least success 
with the recency strategy. Also, we observed that as the number of 
months increases from | to 6 months, we did not see any 
tremendous difference in the success rate for all the strategies. 
Tables 2 - 4 show (unsurprisingly) that as N increases, the success 
rate of the prediction also increases. Comparing all 3 evaluation 
criteria, we achieved the highest success while predicting the user 
with the highest score, although the success rate obtained with the 
other criteria (i.e. predicting the first answerer and best answerer) 
did not differ significantly using S@20. In the next section, we 
show how we attempted to improve the performance of these 
strategies by including an additional measure called timeliness. 


6. PREDICTION OF JUST-IN-TIME 
HELPERS 


The main goal of this study is to predict helpers just-in-time, 1.e. 
helpers who would provide answers as quickly as_ possible. 
Therefore we included a timeliness criterion that takes into 
consideration how quickly a prospective helper would provide an 
answer to a question. We used the 15 minutes time frame as it 
represents the average time in which most questions are answered 
(although, the percentage of questions answered within this time 
frame has decreased as shown in section 3). For each prospective 
helper, we computed the timeliness measure as shown in equation 


(7): 


N£<*°represents the number of questions the user answered within 
15 minutes in the past while N,? represents the total number of 
answers provided by user u. To see how well our various 
strategies work in predicting such just-in-time helpers, we 
multiplied the timeliness score Score” obtained by each user by 
their respective score on each of the other strategies except for the 
recency strategy. We excluded the recency strategy in this 
prediction as it is the weakest measure as shown in tables 2-4. 
Moreover, the recency score computed as shown in equation 7 is a 
timestamp value which cannot be multiplied by the timeliness 
score as can the numeric values obtained with other strategies. 
Finally, since we did not observe any major differences when we 
used the 1 month history data of the prospective helper as 
compared to the 6 month history, in predicting the just-in-time 
helpers we only employed the history data of the prospective 
answerers over the | month time frame. This also saved a lot of 
computational time. The results obtained are shown in tables 5-7 
for each of the evaluation criteria. 
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First Answerer 


frequency 
recency 
eagerness 
knowledgeability 


willingness 


S@1 


frequency 
recency 
eagerness 
knowledgeability 


willingness 


Highest Score S@1 


frequency 
recency 
eagerness 
knowledgeability 


willingness 


18.87% 
131% 
9.89% 
17.97% 
21.06% 


Table 2: Success Rate at Predicting the First Answerer 


31.65% 
20.30% 
21.29% 
28.10% 


35.89% | 


1 Month 


19.60% 
12.20% 
9.36% 
19.18% 
21.40% 


S@1 
31.84% 
21.19% 
19.98% 
29.24% 
35.40% 


1 Month 


S@5 
19.96% 
12.26% 
9.30% 
19.99% 
21.71% 


S@10 
32.48% 
21.62% 
20.09% 
30.29% 
36.23% 


49.13% 
33.60% 
43.57% 
39.52% 


48.25% 
33.91% 
41.03% 
40.66% 


S@20 
49.38% 
34.90% 
42.12% 
41.73% 


3 Months 


18.93% 
11.67% 


10.09% 
17.85% 
21.11% 


S@10 
31.37% 
20.67% 
21.53% 


28.05% 


35.35% 


3 Months 


19.58% 
12.55% 
9.76% 
18.99% 
21.30% 


S@10 
31.72% 
21.48% 
20.69% 
29.27% 
35.08% 


3 Months 


20.28% 
12.92% 
9.96% 
19.92% 
21.89% 


S@10 
32.46% 
22.11% 
21.18% 
30.36% 
36.09% 


3@20 
48.23% 
33.96% 
43.82% 
39:32% 


47.26% 
34.35% 
41.43% 


40.66% 


5 @20 
48.43% 
35.33% 
42.88% 
41.80% 


S@1 


S@1 


6 Months 


20.00% 
12.66% 
10.32% 
19.03% 
22.43% 


S@10 
33.13% 
21.81% 
23.15% 


6 Months 


20.78% 
13.70% 
9.90% 
20.33% 
22.80% 


S@10 
33.34% 
22.84% 
22.06% 
31.22% 
37.29% 


6 Months 


21.49% 
14.11% 
10.19% 
21.16% 
23.45% 


S@10 
34.34% 
23.40% 
22.61% 
32.32% 
38.52% 


S20 
50.81% 
35.59% 
47.00% 
41.94% 


S@20 
50.26% 
36.12% 
44.61% 
43.54% 


S@20 
51.37% 
37.13% 
46.05% 


44.63% | 


Table 5. Timeliness Success at Predicting the First Answerer 


First Answerer ome 
Timeliness S@]1 S@5 S@10 


frequency 


eagerness 
knowledgeability 


willingness 


S@20 
21.86% 
26.71% 
20.10% 
24.89% 


36.16% 
43.31% 
30.45% 
40.55% 


41.54% 
60.34% 


Table 6. Timeliness Success at Predicting the Best Answerer 


Best Answerer 1 Month 
Timeliness S@]1 S@5 S@10 


frequency 
eagerness 
knowledgeability 


willingness 


S@20 
50.84% 
53.76% 
41.45% | 


20.95% 
20.95% 
20.19% 
23.64% 


34.09% 
35.27% 
30.38% 
37.91% | 


Table 7. Timeliness Success at Predicting the Answerer with 
the Highest Score 


Highest Score L Mont 
Timeliness S@1 S@5 S@10 


frequency 


cagemess 


knowledgeability 


willingness 


$@20 
51.94% 
55.34% 
42.54% 


21.46% 
21.47% 
21.06% 
24.30% 


34.98% 
36.47% 
31.38% 
38.93% 


7. DISCUSSION 


The aim of our research is to support lifelong learners as they 
interact with peers in open ended learning environments like SO. 
As lifelong learners are responsible for their own learning [7], 
millions of them depend on such learning forums to meet their 
learning needs on a daily basis. Obtaining timely answers to 
questions is important [2] in supporting lifelong learners and in 
enhancing the sustainability of such an online learning 
community. However, we observed (as shown in section 2) that 
the answer response times to questions have increased and in 
some cases the question askers have to answer their own questions 
themselves, which can deter the lifelong learner. In this study, we 
address this problem by predicting prospective users who are 
likely to provide the most timely answers to their question. 


Previous studies by Greer et al. [3, 4] and Vassileva et al. [10] 
have identified the various strategies that could be used in 
predicting the prospective helpers within the classroom and 
workplace learning environments. In this study we explored the 
effectiveness of the various strategies at predicting prospective 
helpers in SO, an environment with vastly more learners seeking 
answers to their questions than in academic classes. We achieved 
the highest success rate S@20 of 54.20% using the | month time 
line with the willingness strategy. Also, with the recency measure, 
performing the poorest amongst all the measures defined, our 
study affirms the claim by Greer et al. [2] that helpers who have 
recently provided help would be less likely to provide answers 
and they should be exempted to avoid overworking a peer helper. 


We improved upon the results obtained from each of the strategies 
described in section 4, by including an additional criterion called 
timeliness. This criterion takes into consideration the probability 
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that a user would answer a question quickly. We achieved a 
maximum success rate S@20 of 63.15% (eagerness), 55.34% 
(willingness) and 56.65% (willingness) in predicting, respectively, 
the first answerer, the best answerer, and the answerer who will 
provide the highest score. These values represent an improvement 
in the success rate from 43.57% to 63.15% (eagerness), 52.47% to 
55.34% (willingness), 53.63% to 56.65% (willingness) in 
predicting the first answerer, best answerer and the answerer who 
will provide the highest score respectively using the 1 month time 
frame (comparing our results from tables 2-4 with results obtained 
in tables 5-7). While these results likely require improvement, 
these values are an improvement over the previous work by Tian 
et al. [9] whom obtained a success rate S@20 of 12.57% and 
S@100 of 23.06% while predicting the best answerer using the 
topic modelling approach. We believe the results obtained in this 
study for all the strategies defined outperforms this previous work. 
The variation in our results from those of Tian et al. is presumably 
because our study was restricted to questions that were answered 
fairly much on time (i.e. questions with at least one answerer 
within the first hour the question was created). We focused on 
these sets of questions because the goal of our study is to predict 
the just-in-time helpers who will provide quick answers to the 
questions in which case, questions answered late would not 
suffice. Although Yang and Manandhar [11] argued for the use of 
the topic modelling approach in predicting the best answerer, our 
results suggest that this is a less informative approach. 


For each of the peer matching strategies, we also studied their 
performance in predicting the relevant peer helpers using the 
history data for prospective peer helpers for the periods of 1 
month, 3 months and 6 months. Our aim is to understand the 
tradeoff of using older data about the user vs newer data. As Kay 
and Kummerfield [7] already identified, there is a_ trade-off 
between the usefulness of retaining older information about the 
lifelong learner and preserving only the recent data. Our results 
show that employing older information (6 months) about the 
learner was at best only marginally better when compared to the 
results achieved with the newer information (1 month). This 
confirms an earlier study [5] we did in predicting (again in SO) 
what the user would want to learn in the future, where we showed 
that employing shorter term information about the user’s past 
behavior proved more effective in predicting what the user would 
be learning in future 


While we feel that we have achieved good prediction accuracy 
with our strategies (especially as compared to other studies), we 
would still like to enhance the accuracy to ensure the usefulness 
of our strategies in a real learning environment. So, in our next 
experiment, we aim to further improve on our results, pushing 
them well above our current success rates if we can. Our aim will 
be to develop new strategies that can identify users who would 
have been likely to help answer the question quickly. Overall, we 
feel this research is a promising first step for being able to show 
how we can find good peer helpers to help professional lifelong 
learners who are keeping themselves up-to-date through 
interactions with their peers in online forums. 
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ABSTRACT 


This study examined how machine learning and natural language 
processing (NLP) techniques can be leveraged to assess the 
interpretive behavior that is required for successful literary text 
comprehension. We compared the accuracy of seven different 
machine learning classification algorithms in predicting human 
ratings of student essays about literary works. Three types of NLP 
feature sets: unigrams (single content words), elaborative (new) n- 
grams, and linguistic features were used to classify idea units 
(paraphrase, text-based inference, interpretive inference). The most 
accurate classifications emerged using all three NLP features sets 
in combination, with accuracy ranging from 0.61 to 0.94 (F=0.18 
to 0.81). Random Forests, which employs multiple decision trees 
and a bagging approach, was the most accurate classifier for these 
data. In contrast, the single classifier, Trees, which tends to 
“overfit” the data during training, was the least accurate. Ensemble 
classifiers were generally more accurate than single classifiers. 
However, Support Vector Machines accuracy was comparable to 
that of the ensemble classifiers. This is likely due to Support Vector 
Machines’ unique ability to support high dimension feature spaces. 
The findings suggest that combining the power of NLP and 
machine learning is an effective means of automating literary text 
comprehension assessment. 


Keywords 


Natural language processing; supervised machine learning; 
classification; interpretation 


1. INTRODUCTION 


Text comprehension researchers employ a variety of methods to 
assess how people process and understand the things that they read. 
The majority of this work has focused on how readers comprehend 
expository or informational texts (e.g., science textbooks or 
historical accounts) and simple narratives (e.g., brief plot-based 
texts). Much less work has been done to investigate the kinds of 
processes that occur when readers read literary texts, such as the 
poems, short stories, and novels assigned in English-Language Arts 
classrooms [1]. More so than in other text domains, literary text 
comprehension requires the construction of interpretations that go 
beyond the literal story to speak to a deeper meaning about the 
world at large [2]. 


In order to measure interpretation and _ assess literary 
comprehension, researchers have relied on collecting students’ 
essays about the text. The essay can then be scored in a variety of 
ways to address different questions about the comprehension 
process [3]. Unfortunately, reliably evaluating essays is both time 
and resource intensive. In other text domains, researchers have 
begun to develop natural language processing (NLP) tools to 
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automate this scoring [4,5]. With this in mind, our goal was to 
develop a means of automatically assessing students’ essays about 
literary texts, with particular attention readers’ interpretation of a 
text’s potential deeper meaning. 


Our purpose was to investigate if NLP and machine learning could 
be combined and leveraged to accurately predict human ratings of 
students’ essays. We drew upon existing text comprehension 
research to identify and extract three NLP feature sets that were 
relevant to literary text comprehension. These feature sets were 
used to compare seven machine learning classification algorithms 
in their ability to classify idea units in student essays as literal 
(paraphrase or text-based inferences) or interpretive. 


1.1 Text Comprehension 

The field of text comprehension investigates the complex activities 
involved in how people read, process, and understand text. As 
people read, they generate a mental representation, or mental 
model. The quality, structure, and durability of this representation 
reflect the reader’s comprehension of the text [6,7]. A critical 
aspect of this mental representation is the inclusion of inferences. 
Inferences connect different parts of the text or connect information 
from the text to information from prior knowledge. Those who 
generate more inferences have a more elaborated mental 
representation [6,7]. Importantly, different types of texts and tasks 
afford different amounts and types of inferences [8]. For example, 
readers studying for an upcoming test generate explanatory and 
predictive inferences, whereas readers reading for fun generate 
personal association inferences. These different types of inferences 
suggest readers are engaging in different processes and are 
constructing different mental representations of the text [9]. Given 
the importance of inferences in successful text comprehension, a 
majority of text research is aimed at understanding when and how 
inferences are constructed [10]. 


1.2. Literary Comprehension 

In the study of literary text comprehension, researchers are 
interested in interpretive inferences. Interpretive inferences reflect 
a representation of the author’s message or deeper meaning [11]. 
Take for example, the story of the Tortoise and the Hare. A reader 
may make text-based inferences to maintain a_ coherent 
representation of the events of the text. A reader might generate the 
inference The tortoise was able to pass the hare because the hare 
was Sleeping to explain why the slow tortoise was able to beat the 
speedy hare. In contrast, a reader might generate an interpretive 
inference that goes beyond the story world to address the moral or 
message of the story, such as /t is better for someone to be 
perseverant than talented. Research indicates that expert literary 
readers (e.g., English Department faculty or graduate students) 
allocate more effort to generating interpretive inferences, whereas 
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novices, who tend to have less domain-specific reading goals and 
strategies, tend to merely paraphrase, or restate the plot. 


Notably, there is no one “right” interpretation, but rather a 
multitude of possibilities that may be more or less supportable by 
the evidence in the text [11,12]. Indeed, some might argue that the 
moral of the Tortoise and the Hare is not about the tortoise’s 
achievement, but instead reflects a cautionary message about the 
hare’s behavior, such as People should not be over-confident. As 
such, assessing interpretation is more difficult than evaluation of 
performance in well-defined domains that have a single correct 
answer. To capture and assess interpretations, researchers have 
relied on open-ended measures, such as think-aloud protocols, in 
which readers talk aloud about their processing as they read through 
the text [13,14,15] and through post-reading essays in which 
students construct responses to various writing prompts [16]. The 
transcribed think-aloud data and essays are then parsed into 
sentences or idea units and scored for the kinds of paraphrases and 
inferences present. In order to reliably categorize the idea units and 
essay quality, experts develop and refine a codebook that is then 
used to train raters. These raters work both independently and 
collaboratively to reach a satisfactory metric of reliability, such as 
percent agreement or intra-class correlation. 


13 Natural Language Processing 

More recently, a push has been made to incorporate NLP in text 
comprehension research [17]. Linguistic features from existing 
texts are extracted using NLP tools [18]. These tools draw upon 
corpora of large sets of texts and human ratings to measure aspects 
of language, such as word overlap, semantic similarity, and 
cohesion. NLP tools can be used to identify and measure linguistic 
features that reliably predict human essay ratings [4]. 


2; DATA & METHODS 
2.1 Corpus 


The corpus included 346 essays written by college students from 
two experiments investigating literary interpretation [16,19]. The 
essays were written about two different short stories from different 
literary genres (science-fiction, surrealist). In the behavioral 
experiments, participants received differing reading instructions 
and writing prompts that biased readers towards paraphrasing or 
interpretation. 


2.2 Human Ratings 

Four expert raters scored the set of essays using a previously 
developed codebook [16]. Essays were parsed into idea units (n = 
4,111) and each idea unit was labeled as verbatim, paraphrase, text- 
based inference, or interpretive inference (Table 1). Given the low 
amount of verbatim units, verbatim and paraphrase were collapsed 
into a single paraphrase type. 


2.3 Classification Algorithms 

Machine learning investigates how machines can automatically 
learn to make accurate predictions based on past observations. 
Classification is a form of machine learning that uses a supervised 
approach. In supervised machine learning, the model learns from a 
set of data with the class labels already assigned. The model uses 
this existing classification to make classifications on new data. 


Data classification consists of two steps; a learning step (or training 
phase), and a classification step. In the learning step, a classification 
algorithm builds a model by “learning from” a training set 
composed of database tuples, and their associated class labels. A 
training set may be represented as (X, Y), where Xi is an n- 
dimensional attribute vector, Xi=(xi, X2,...Xn) depicting n 
measurements made on the tuple from n database attributes, 
respectively Ai, A2,..An. Each attribute represents a ‘feature’ of X. 
Each Xi belongs to a pre-defined class label, represented as Yi [20]. 
In the classification step, the trained model is used to predict class 
labels for a test set of new data set that has not been used during 
model training. This test data is used to determine the accuracy of 
a classification algorithm, or classifier. 


Some of the most commonly used classification algorithms are 
Naive Bayes [21], Decision Trees [22], Maximum Entropy [23,24], 
Neural Networks [25], and Support Vector Machines [26,27]. In 
addition, researchers also employ ensemble techniques that use 
more than one of the classifying algorithms. These ensemble 
algorithms include Bagging [28], Boosting [29], Stacking [30], and 
Random Forests [31]. 


2.3.1 Naive Bayesian 

Naive Bayesian algorithm is based on the Bayes’ theorem of 
posterior probability. It is a probabilistic learning method. It 
assumes that the effect of an attribute value on a given class is 
independent of other attributes values [21]. 


Table 1. Idea unit identification: Definitions and examples 
(From McCarthy & Goldman, 2015) 


Type Description 


Example from Harrison Bergeron 


Example from The Elephant 


Verbatim 


Paraphrase 


Text-Based 
Inference 


Interpretive 
Inference 
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Copied directly from the text 


Rewording of the sentences from the 
text; Summary or combining of 
multiple sentences from the text 


Reasoning-based on information 
presented in the story, with some use of 
prior knowledge; connecting 
information from two parts of the text 


Inferences that reflect nonliteral, 
interpretive interpretations of the text 


The Handicapper General, came into the 
studio with a double-barreled ten-gauge 
shotgun. She fired twice, and the 
Emperor and the Empress were dead 
before they hit the floor. 


Then [Harrison] and the ballerina were 
killed by Diana Moon Glampers, the 
Handicapper General. 


Diana Moon Glampers killed them 
because they tried to show their true 
selves. 


It shows what kind of a place the world 
can turn out to be if we let [the 
government] get out of control. 


The schoolchildren who had witnessed the scene 
in the zoo soon started neglecting their studies 
and turned into hooligans. It is reported they 
drink liquor and break windows. And they no 
longer believe in elephants. 


After seeing this the students gave up on 
education became drunks and stopped believing 
in elephants. 


After being deceived by the fake elephant, the 
children became poor students, and grew up 
behaving badly because they were lied to 


The theme is that being lied to ends the 
innocence of the young boys and girls. 
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2.3.2 Decision Trees 

The Decision Trees learning method approximates discrete-valued 
target functions. The learned function is represented as a decision 
tree, which is further represented as a set of if-then rules. Each node 
in the tree specifies a test of some attribute, and one of the possible 
values of the attribute represents a branch in the tree. The attribute 
considered for a node is based on the statistical property, 
information gain [22]. 


2.3.3 Maximum Entropy (MaxEnt) 
MaxEnt models work on a simple principle, and choose a model 
that is consistent with all of the given facts. The models are based 


on what is known, and do not make any assumptions about the 
unknowns [23,24]. 


2.3.4 Neural Networks 


Neural Networks is a computational approach based on a collection 
of neural units. It is an attempt to model the information processing 
capabilities of the human nervous system. These models are self- 
learning, and use a back-propagation algorithm for updating the 
weights based on feedback [25,32]. 


2.3.5 Support Vector Machine (SVM) 


SVM constructs a hyperplane that separates the data into classes. 
SVMs are efficient for high-dimensional feature spaces and are 
among the best supervised learning algorithms [26,27]. 


2.3.6 Bagging 

Bagging (or Bootstrap Aggregation), is a meta-algorithm that 
considers multiple classifiers. It creates bootstrap samples of a 
training set using sampling with replacement. Bagging trains each 
model in the ensemble using each bootstrap sample, and performs 
classification based on majority voting from trained classifiers [28]. 


2.3.7 Boosting 

Boosting, a meta-algorithm that incrementally builds an ensemble 
by iteratively training weak learners or classifiers. While training 
new models, it emphasizes instances that are misclassified by the 
previous models. Thus, each model is trained on weighted data 
from the previous model performance. The final result is the 
weighted sum of the results of all of the classifiers [29]. 


2.3.8 Stacking 


Stacking (or stacked generalization), combines multiple classifiers 
generated by different learning algorithms on a single data set. This 
algorithm works by first generating a set of base-classifiers, and 
then trains a meta-level classifier to combine the outputs of the 
base-classifiers [30]. 


2.3.9 Random Forests 

Random Forests (or random decision forest) is designed to 
overcome the “overfitting” problem of decision trees. Random 
Forests constructs a multitude of decision trees in the training 
phase, and uses majority voting for classification [31,33,34]. 


2.4 Feature Sets 


Three NLP feature sets were identified as theoretically relevant to 
the objective: unigrams, linguistic characteristic scores, and 
“elaborative” (new) unigrams. 


2.4.1 Unigrams 

Unigrams are the individual content words present in the idea units. 
The value of a unigram feature was the frequency of that unigram 
in the corpus. Some of the most common words appearing in the 
idea units are elephant (>1000), story (575), zoo (429), handicap 
(361), government (323), believe (158), and think (147). 


2.4.2 Linguistic Characteristics 

The second set of features considered were the linguistic 
characteristic scores. Ideas that reflect events from the text are 
likely to be more concrete, whereas those that are interpretive 
reflect themes (e.g., freedom, loss of innocence) are more abstract 
[35]. Thus, both concreteness and imagability were included as 
indices. Related to the greater sophistication in interpretive 
language, we also included word familiarity and age of acquisition. 
These linguistics characteristics were derived from merging norms 
of human ratings from three sources [36,37,38]. Details of merging 
are provided in appendix 2 of the MRC Psycholinguistic Database 
User Manual [39]. The characteristics, as defined by McNamara 
and colleagues [40], appear in Table 2. 


Table 2. Descriptions of relevant linguistic characteristics 
(From McNamara, Graesser, McCarthy, and Cai, 2014) 


Linguistic Description 
Characteristic 


The degree to which a word is non-abstract 


Imagability How easy it is to construct image of a word 
in one’s mind 


How familiar a word is to an adult 
Age of The age at which a word first appears in a 
Acquisition child’s vocabulary 


2.4.3 Elaborative n-grams 

The third feature set was the frequency of “elaborative” n-grams. 
These were words (unigrams), two consecutive words (bigrams) or 
three consecutive words (trigrams) that were new in the sense that 
they appeared in the idea units, but not in the original story. In 
addition, frequency of occurrence of a set of cue words or phrases 
that indicate an interpretive idea unit was included in this feature 
set. 


We used a set of ‘R’ packages for implementing classification 
algorithms, and extracting the feature sets. The ‘R’ packages used 
for classification include ‘RTextTools’, ‘e1071’, ‘randomForest’, 
‘nnet’, ‘MASS’, and ‘caret’. The packages used for text mining, 
and extracting n-grams from the idea units and essays were ‘tm’, 
‘tau’, “openNLP’, ‘qdap’, and ‘quanteda’. 


3. EXPERIMENTS & RESULTS 


3.1 Feature Selection 

The three NLP feature categories (frequency of unigrams, linguistic 
features of words, and number of “elaborative” n-grams and cue 
words) were tested in seven experiments. 


The total number of unigrams extracted from the idea units was 
4,406, resulting in a frequency matrix of 4,111 X 4,406 dimensions. 
This was more than the number of idea units in the corpus. As a 
means of reducing the dimensions in the data set, highly correlated 
unigrams (Pearson r > .65) were removed. However, this exercise 
did not significantly reduce the dimensions. It was noted that many 
of the unigrams did not appear frequently. Several frequency 
thresholds were tested to determine a frequency that would reduce 
dimensions, but not overly affect the accuracy of the model. It was 
determined that a frequency threshold of 10 was sufficient. 
Including only those unigrams that appeared in the corpus at least 
10 times reduced the feature dimensions from 4,406 to 609. 


For the second set of features we considered an initial set of 56 
linguistic characteristics. The linguistic features included 
concreteness, familiarity, inagability and age of acquisition scores 
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for all the words, content words, function words, and all words with 
or without keywords. These features were extracted using two NLP 
tools: the Tool for the Automatic Analysis of Lexical 
Sophistication [41] and the Tool for Automatic Analysis of Text 
Cohesion [42]. Highly correlated (Pearson r >.85) features were 
removed, yielding 18 linguistic features for the classification tests. 


For the “elaborative” n-grams feature set (unigrams, bigrams, and 
trigrams present in the idea units, but not the original story and cue 
words), the bigrams and trigrams were found to be highly correlated 
(Pearson r > 0.85). Consequently, only trigrams were included. In 
total, three features were used in the elaborative n-gram feature set 
for classification. 


This final feature set was used to classify each idea unit as 
paraphrase, text-based inference, or interpretive inference using 
ML classification algorithms. Similar approaches have been used 
to classify other kinds of texts [43]. 


3.2 Idea Unit Classification 


After experimenting with a large number of classification 
algorithms, we selected four machine learning classification 
algorithms (Trees, Support Vector Machine [SVM], Neural 
Networks, Maximum Entropy [MaxEnt]), as well as three ensemble 
approaches (Bagging, Boosting, Random Forests) to classify the 
idea units. Multiclass classification algorithms and 10-fold cross- 
validation were used in seven experiments to test the feature sets 
(609 unigrams, 18 linguistic features, and 3 elaborative n-grams) 
individually and in combination. Summary of classification 
accuracy for all the algorithms is presented in Table 3. 


The bold entries in Table 3 indicate the maximum accuracy for each 
of the features. Random Forests achieved the highest accuracy for 
all experiments except when using elaborative n-grams as features. 
The Boosting algorithm classifier achieved the maximum accuracy 
in this case. 


The italicized entries in Table 3 indicate the maximum accuracy 
achieved by a classification algorithm. Generally, the classification 
algorithms achieved high accuracy when a combination of all 
features was used. The accuracy for the algorithms varied between 
0.77 and 0.94 when considering a combination of all the features, 
except for the Trees algorithm where the accuracy was quite low, 
0.61. In fact, the accuracy for the Trees algorithm was low in all 
cases irrespective of the features considered. 


F-scores for the three types of idea units produced by participants 
(interpretive, paraphrase, text-based) are summarized in Tables 4 
and 5 for single classifiers and ensemble of classifiers, respectively. 
The bold numbers indicate the highest F-score for each type of idea 
unit. For the single classifiers, SVM achieved the highest F-score 
for paraphrases (F = 0.81) and for interpretive inferences (F = 0.73). 
MaxEnt obtained the highest F-score for single classifiers for text- 
based inferences (F = 0.42). For ensemble classifiers, Random 
Forests again performed the best, with the highest F-scores for 
paraphrases (F = 0.80) and interpretive inferences (F = 0.70). The 
Bagging algorithm achieved the highest F-score (0.30) for text- 
based inferences in ensemble category. The F-scores for identifying 
text-based inferences were relatively low, suggesting a machine 
learning approach may be better suited for identifying paraphrases 
and interpretations. The NAs in Table 4 indicate that the algorithm 
did not classify any idea unit as text-based. 


Table 3. Accuracy for different classification algorithms with different feature combinations 
'Unigrams (n=609); "Linguistic Features (n=18); *Elaborative n-grams (n=3; unigrams, trigrams, cue words) 


Classification Algorithm 


Feature SVM Trees MaxEnt NeuralNets Boosting Bagging Random Forests 
UNI! 0.75 0.58 0.81 0.77 0.73 0.75 0.86 

LIN? 0.80 0.56 0.55 0.58 0.77 0.92 0.94 

ENC? 0.64 0.60 0.58 0.62 0.79 0.63 0.61 

UNI + LIN 0.77 0.58 0.83 0.76 0.74 0.92 0.95 

UNI + ENC 0.78 0.61 0.80 0.77 0.77 0.82 0.88 

LIN + ENC 0.92 0.59 0.62 0.63 0.79 0.93 0.94 

UNI + LIN+ ENC 0.81 0.61 0.82 0.77 0.79 0.93 0.94 

Table 4. F-Scores for Single classifiers 
'Unigrams (n=609); *Linguistic Features (n=18); “Elaborative n-grams (n=3; unigrams, trigrams, cue words); 
“Interpretive; °Paraphrase; “Text-based Inference 
SVM Trees MaxEnt NeuralNets 
Feature Inter¢ Para>  TB® Inter Para TB7 Inter’ Para TB Inter’ Para TB 
UNI! 0.71 0.80 0.28 0.44 0.71 NA 0.65 0.76 0.36 0.63 0.76 0.13 
LIN? 0.45 0.73 0.13 0.27 0.70 NA 0.52 0.66 0.30 0.46 0.73 NA 
ENC? 0.46 0.73 0.03 0.52 0.73 NA 0.50 0.72 NA 0.57 0.74 NA 
UNI + LIN 0.70 0.81 0.35 0.49 0.72 NA 0.66 0.77 0.41 0.64 0.79 0.08 
UNI + ENC 0.73 0.81 0.34 0.55 0.74 NA 0.69 0.78 0.38 0.62 0.73 0.18 
LIN + ENC 0.48 0.73 0.11 0.50 0.73 NA 0.58 0.74 0.25 0.61 0.77 NA 
UNI+LIN+ENC 0.72 0.81 0.36 0.55 0.74 0.30 0.70 0.79 0.42 0.63 0.79 0.06 
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Table 5. F-Scores for Ensemble classifiers 
'Unigrams (n=609); "Linguistic Features (n=18); “Elaborative n-grams (n=3; unigrams, trigrams, cue words); 
“Interpretive; °Paraphrase; “Text-based Inference 


Boosting 
Feature Inter* Para’ TB° 
UNI! 0.65 0.77 0.06 
LIN? 0.49 0.70 0.09 
ENC? 0.52 0.73 0.06 
UNI + LIN 0.57 0.73 0.12 
UNI + ENC 0.62 0.76 0.07 
LIN + ENC 0.55 0.73 0.23 
UNI +LIN + ENC 0.61 0.76 0.18 
4. CONCLUSIONS 


This study demonstrates that a classification approach using 
unigrams, linguistic features, and “elaborative” n-grams can be 
used to accurately predict human ratings of idea unit classification 
for essays about literary works. 


This study indicated that ensemble classification algorithms were, 
generally, more accurate than single classifiers. Random Forests, 
which is an ensemble of decision trees and uses a bagging 
approach, was the most accurate classifier and had the highest F- 
scores for most types of idea units. In contrast, the single classifier 
Trees showed relatively low accuracy. This finding is consistent 
with previous work that suggests Trees “overfits” to training data 
and, as a result, performs poorly on test data [44]. 


Interestingly, performance from the single classifier SVM was 
comparable to the ensemble classifiers. This classifier may have 
been highly accurate due to the fact that our data had a large amount 
of features under consideration. SVM is designed to support high- 
dimension spaces and data that may not be linearly separable. 


This study provides a model for how machine learning and NLP 
can be used to assess literary text comprehension. In addition to 
being economical for researchers recruiting large samples and 
collecting large amounts of essay data, the approach can also be 
implemented in other automated writing evaluators (AWEs) to 
provide domain-specific assessment and feedback. 


The presence of interpretive inferences suggests that a reader has 
successfully moved beyond the literal to engage in domain- 
appropriate interpretations. However, interpretive inferences are 
not necessarily indicative of higher quality literary text 
comprehension. Literary comprehension requires not only 
generating interpretations, but also justifying those interpretations 
with evidence from the text as well as appeals to cultural and 
literary norms [1,45]. Hence, good essays are likely to have a 
relatively even distribution of the various types of ideas (e.g., both 
inferences and interpretations). Our future plans include assessing 
the essays holistically and develop algorithms to predict those 
scores. Our ultimate objective is to better understand the relations 
between idea unit types and essay quality as well as to further the 
development of automated assessment of literary comprehension. 
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0.76 0.17 0.68 0.79 0.27 
0.72 0.26 0.51 0.74 0.21 
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0.77 0.27 0.70 0.80 0.28 
0.75 0.25 0.58 0.77 0.21 
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ABSTRACT 


As higher education institutions develop fully online course 
programs to provide better access for the non-traditional learner, 
there is increasing interest in identifying students who may be at 
risk of attrition and poor performance in these online course 
programs. In our study, we investigate the effectiveness of an 
online orientation course in improving student retention in an 
online college program. Using student activity data from the 
orientation course, Engage, we make use of machine learning 
methods to develop prediction models of whether students will 
be retained and continue to register for program-specific courses 
in the eVersity program. We then discuss the implications of our 
findings on improvements that may be made to the existing 
orientation course to improve student retention in the program. 


Keywords 


Prediction modeling, online orientation course, student retention 


1. INTRODUCTION 


With the widespread development of online learning programs 
in institutes of higher learning, access to a college education has 
improved by a considerable amount. Despite increased 
enrollment rates within these online degree programs, however, 
student attrition or dropout rates also tend to be correspondingly 
higher than in traditional face-to-face degree programs [4, 21]. 
Dropout can occur early for many students in online programs; 
some students drop out even before they register for their first 
course [24]. As such, it has become increasingly important for 
facilitators and administrators to identify factors that may 
influence attrition and retention in these online course offerings, 
and implement targeted interventions to increase retention. 


Some of these targeted interventions involve the use of machine 
learning to provide timely information on student progress 
within a course to teachers and facilitators [1, 12, 17]. These 
interventions allow them to identify at-risk students earlier on 


within an online course, and take steps to encourage student 
retention. Another type of intervention involves the 
development of online orientation courses taken before the 
beginning of the program. These courses aim to provide students 
with the support and resources they may need during their 
progression through the program [3, 8]. A combination of the 
above interventions may also be implemented where machine 
learning models are developed to identify patterns in student 
behavior within online orientation courses themselves, which 
could help inform teachers and facilitators of students at risk of 
dropout even earlier on within an online program. 


In this study, we use machine learning to investigate student 
behavior within a required online orientation course, Engage, 
for students registered in an online university, eVersity. eVersity 
is a completely online course program established and 
developed by the University of Arkansas System (UAS). Using 
student data in this online orientation course, we developed a 
model that allows us to predict the likelihood of their continued 
participation in the online college program, through their 
registration in future program-specific courses. 


2. LITERATURE REVIEW 


There has been extensive research in recent years to identify 
factors that lead to low student retention rates, particularly 
within the context of online learning programs [9, 16, 25]. 
Attrition and retention can be defined in several ways. Since this 
paper is focused on an online course program that emphasizes 
learning at students’ own pace and preferred time(s), we make 
use of the definition proposed by Pascarella and Terenzini [22] 
(p.374), where retention is defined as progressive re-enrollment, 
whether continuous from one term to the next, or temporarily 
interrupted and then resumed, until completion with a degree. 


Several researchers have found that student dropout rates in 
online courses are due to a variety of circumstances, including 
personal, job, or technology-related reasons [25], and are 
typically independent of demographic factors such as gender and 
race [2, 11, 25]. Park et al. [20] also found that organizational 
support and course relevance are better predictors than 
demographic variables, and significantly predict student 
persistence as well as student dropout in online course 
programs. Both O’Brien & Renner [18] and Jung et al. [14] 
replicated these findings and found that online courses that 
increase opportunities for student interaction, such as group 
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work, tend to improve student engagement, thereby reducing 
student dropout. 


A popular intervention that has been implemented to improve 
student retention, based on these findings, is the development of 
orientation courses that seek to provide new students with 
organizational support, guidance, and resources that they may 
need to support their online learning. Studies have found that 
such online orientation courses can be effective at improving 
retention and the overall student learning experience [5, 8, 13]. 


Other interventions have focused on providing information to 
instructors, academic advisors, and facilitators on which 
Students are at risk, so that the student can be contacted and 
better supported [1, 12, 17]. Increasingly, these types of 
interventions have been driven using automated models that can 
identify students who are at risk of dropping out or performing 
poorly, so that instructors and facilitators can focus intervention 
efforts on the students who are most likely to be benefit from an 
intervention. The use of data mining techniques has enabled 
course facilitators to identify at-risk students early on within a 
course. For instance, Dekker and colleagues [7] made use of 
data mining techniques to identify students at risk of dropping 
out from an electrical engineering program, after the first 
semester of their studies, or even before they enter the program. 
In another study, Lauria et al. [15] developed models to predict 
student performance based on course management system data 
as well as student academic records. 


Such models have then been used by higher education 
institutions to provide support through early interventions to at- 
risk students. This type of intervention has been developed and 
implemented by various universities and companies, including 
Purdue University, Marist College, Civitas Learning, and 
ZogoTech [1, 10, 12, 17]. Arnold & Pistilli’?s work [1], for 
example, examines the development and implementation of 
Course Signals at Purdue University. Course Signals makes use 
of learning analytics to help course faculty provide accurate 
real-time feedback to their students about whether they are on 
track to succeed in their current course. Analyses of student 
performance showed that students who participated in at least 
one Course Signals course achieved better grades and 
experience higher retention rates than their peers who did not 
participate in any Course Signals courses. Similarly, Fritz [10] 
makes use of learning analytics to develop an intervention called 
“Check My Activities’, where students are given the 
opportunity to compare their online course activity against an 
anonymous summary of their peers in the course, thus providing 
early system feedback directly to the students so that they are 
more aware of their own levels of engagement within a course. 


3. EVERSITY — ONLINE LEARNING 


The eVersity is a fully online institution for the University of 
Arkansas System, which is comprised of institutions of higher 
education across the state. The mission of eVersity is to provide 
online education specifically for adult learners; in particular, at- 
risk learners who may have previously dropped out of college 
and may require additional support to be successful 
academically. Currently the eVersity student population is 65% 
female, 69% white, 27% black or African American, and the 
average age is 36. Each academic term runs for a short 6 weeks 
to allow enrolled students maximum flexibility in fitting the 
online courses within their schedules. 


To better serve students, eVersity offers a free credit-barring 
orientation course, Engage. This course fulfills two functions, 
both related to the goal of improving student retention: to 
introduce students to the tools and information they need to be 
successful in an online learning environment, and for the 
institution to get to know its students. Engage also aims to 
provide resources and guidance to new students as they continue 
on to register in program-specific online courses within 
eVersity. Upon enrollment in the eVersity program during any 
of the seven terms throughout the year, students are 
automatically registered in Engage. Within Engage, information 
is organized into 6 Steps: Welcome, Getting to Know You, 
Funding My Future, Supporting My Academic Success, 
Developing My Learning Plan, and My Financial Plan. Students 
are free to explore the six course sections at their own pace 
within the six-week academic term. 


To ensure student participation within each section of the 
course, students are required to complete knowledge checks and 
assessments at the end of each Step before they can access the 
next Step. These assessments and checkpoints help students to 
process the information provided within each Step, and provide 
students with practice opportunities to complete work in online 
formats that will be commonly used within later program- 
specific courses, such as uploading assignments and journal 
entries, and taking online quizzes. Completion of the Engage 
course is required for students who wish to continue on to 
register for program-specific courses on eVersity. 


4. METHODS 


4.1 Orientation Course Data 

The dataset used for analysis was obtained from the Blackboard 
online learning system, and included student data from the first 
rollouts of the Engage course in the October 2015 and January 
2016 terms. As discussed above, each term spans approximately 
six weeks. The data set provided resource access information per 
student, including date accessed and page accessed, as well as 
actions performed while on these pages. Resources accessed and 
respective actions include: 


1. Journals: add journal entry, view draft, edit journal 
entry 


2. Assessments: launch assessment, review attempt, save 
attempt, submit assessment 


3. Assignments: upload assignment 
4. Discussion Boards: discussion entry, discussion reply 


5. Messages: view messages, email instructor, email 
select students 


6. Gradebook: check grade 


We also obtained demographic data consisting of each student’s 
age, gender, race, whether or not their parents attended college, 
and whether or not they registered for a class in any of the three 
academic terms immediately following the completion of the 
Engage orientation course. Of the cohort, a total of 151 students 
registered for courses after completing the Engage orientation 
course. 


We then built a prediction model to identify which student 
features are more strongly associated with future registration in 
for-credit courses on eVersity. 
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4.2 Data Cleaning and Feature Generation 
The data set obtained from eVersity included resource access 
data, and demographic and enrollment data. It represented 
97,298 page accesses and actions across 325 students. 


During their use of Engage, these students interacted with 
course content (i.e., video lectures), journals, assessments in the 
form of online quizzes, assignments, discussion boards, 
messages, and the gradebook. Each transaction within the access 
log contained a user ID, date stamp (with no time data 
available), page accessed, and, where relevant, the action 
performed. 


The features investigated in this study included: 


1. Total counts — total number of times student accessed 
each resource regardless of what action they 
performed (e.g., total count for journal access is the 
sum of the total count of journal access to write a new 
post and the total count of journal access to edit an 
existing post) 


2. Days till first access — number of days since start of 
interaction until a student accessed any of the 
resources and performed each of their specific actions 


3. Days between — average number of days between 
specific resources accesses and actions performed 
(e.g., average number of days between two journal 
views, average number of days between creation of a 
journal post and editing or submitting the same 
journal post) 


4. Inactivity — average number of days inactive (_e., 
number of days between any two transactions) 


5. Descriptive statistics — average, standard deviation, 
minimum, and maximum values per resource access 
across days the student interacted with Engage 


In calculating these features, we excluded behaviors that were 
required to complete the Engage course. Completing the Engage 
course was required in order for a student to continue on to 
register for a program-specific course, so any feature required to 
complete Engage would be _ tautologically connected to 
registering for a program-specific course. Specifically, we 
excluded student activity around completing assessments, 
uploading assignments, and adding journal entries. We thus 
removed these features in order to identify other student actions 
that may be related to future student registration in an eVersity 
course, but are not explicitly required for the student to register 
in an eVersity course. 


4.3 Prediction Modeling 


Prediction models of student activity were created using 
RapidMiner 5.3 in order to determine which combination best 
predicts whether a student will register in a program-specific 
course after completing Engage. We attempted to predict this 
variable using J-Rip classification and J-48 decision trees, with 
10-fold student-level cross-validation. Cross-validation splits 
the data points into N equal-size groups. In the case of the 
current study, data points were split into 10 groups. It then trains 
on all groups but one, and tests on the last group, and does so 
for each possible combination. 


J-48 decision trees, the RapidMiner Weka Expansion Pack 
implementation of the C4.5 algorithm, can handle both 


numerical and categorical predictor variables. The algorithm 
repeatedly looks for the feature which best splits the data in 
terms of predictive power for each variable. It later prunes out 
branches that turn out to have low predictive power. Different 
branches can have different sets of features. In cases where 
numerical predictors are used, the algorithm tries to find the 
optimal split. J-Rip is the RapidMiner Weka Expansion Pack 
implementation of the Repeated Incremental Pruning to Produce 
Error Reduction (RIPPER) [6], a propositional rule learner. J- 
Rip produces a set of rules, through stages of growing and 
pruning, that account for all classes and minimizes error. 


Model variable selection was conducted using forward selection, 
where the feature that most increases fit is added to the current 
model, until no additional features improve the model. The 
resultant models’ performance was assessed using Cohen’s 
Kappa and AUC ROC. Kappa indicates the degree to which the 
detector is better than chance at identifying a modeled construct. 
0 means that the model is no better than chance, and | means 
perfect performance. AUC ROC is the area under the ROC 
curve, and is also the probability that given | instance of 
‘registered’ and 1 instance of ‘not registered’, the model is able 
to tell which instance is which. It is computed using the A’ 
implementation to control for artificially high AUC ROC 
estimates due to having multiple data points with the same 
confidence. An AUC ROC value of 0.5 indicates chance level of 
performance, while a value of 1 means perfect accuracy. 


4.4 Demographic Cross- Validation 

Some prior research has shown that prediction models may have 
different levels of accuracy for different subgroups within the 
data set [19]. To determine whether this was a concern, we 
evaluated the performance of the models across different 
demographic groups in our data set. After the models had been 
developed and cross-validated, we took the model’s prediction 
on the test sets and evaluated their performance on sub sets of 
the data based on the different demographic groups in our 
sample. In particular, we compared the performance of the 
model by gender (male versus female), race (white versus 
African-American) and parents’ college education (parents 
attended college versus parents did not attend college). In 
addition to the majority of white and African-American students 
analyzed, 7 students were Native American. This number of 
students was insufficient to allow for a valid calculation. We 
then calculated performance metrics for each of these 
demographic groups. 


5. RESULTS 


5.1 Model and Performance 

Prediction models created using the W-J48 and W-JRip 
classification algorithms resulted in high kappa and AUC 
values. Both algorithms used resulted in comparably high 
performance. As such, we will discuss both of these models 
below. The full set of models run and their respective 
performance values can be found in Table 1. 


Table 1. Cross-validated performance of models of student 
enrollment with different classification algorithms 
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5.1.1 J-48 Model 


With the J-48 model, a total of four features were selected in 
some folds of the cross-validation, but not all of them were 
selected in the final model fit on all data: 


e number of days before grades were first checked by 
the student, 


e minimum number of times grades were checked by the 
student, 


e total number of views of online messages within the 
course platform, and 


e total number of views of the Discussion Board Reply 
page. 

The four features initially selected in some of the cross- 
validation folds indicate that students who checked their course 
grades earlier and more frequently, responded more to 
discussion board posts, and viewed in-course messages more 
frequently were more likely to register in a program-specific 
eVersity course after completing Engage. 


The final decision tree generated using this algorithm contained 
3 leaf nodes and 2 decision nodes. The decision tree generated 
by the prediction model is shown in Figure 1. 


As can be seen in the figure, only 2 of the selected features had 
strong enough associations with future course registration to be 
included in the pruned decision tree built on all data: Number of 
views of the Discussion Board Reply page, and the number of 
days till the first time the student checks their course grades. 


The decision tree generated with the J-48 model, shown in 
Figure 1, provides an indication of how each student’s future 
course registration is predicted, and the confidence level 
assessed for each student’s prediction. 


Total discussion board 
replies (views) 


Future course registration 
prediction: 1 
Confidence: 86.4% 


Future course registration 
prediction: 0 
Confidence: 99.1% 


Figure 1: Visual representation of the decision tree generated 
by the J-48 algorithm 


The decision tree in Figure | shows that a student who has made 
fewer attempts to respond in the discussion board is less likely 
to register in a program-specific course in the future, with a 
confidence of 98.8%. Similarly, we can see that students who 
checked their course grades earlier on during the term were 
more likely to register for a program-specific course afterwards, 
with a confidence of 86.4%. In contrast, students who only 
viewed their course grades much later after the start of the 


orientation course or not at all had a 99.1% confidence of not 
registering for another eVersity course in the future. 


5.1.2 J-Rip Model 


In the J-Rip model, on the other hand, only one feature was 
selected: the total number of views of the Discussion Board 
Reply page. Based on the J-Rip model classification rules, 
students who viewed the Discussion Board Reply page more 
often (>= 3 times) within the duration of the orientation course 
had a higher probability of registering in an eVersity course 
afterwards, with a confidence of 82.4%. In contrast, students 
who viewed the Discussion Board Reply page 3 times or fewer 
during the course had a lower likelihood of registering in 
another course later on, with a confidence of 98.8%. 


The J-48 and J-Rip models obtained comparable performance 
metrics, with the J-48 model having a marginally higher AUC 
value than the J-Rip model, and the J-Rip model having a 
slightly higher Kappa value than its J-48 counterpart. This 
implies that the J-Rip model had a higher proportion of correct 
predictions when thresholded, but because only’ one 
classification rule was selected, there were only 2 confidence 
values that were associated with these predictions, hence 
resulting in a lower AUC value. In contrast, more features were 
selected in the J-48 model (and more differentiations were 
made), which could explain the slightly higher AUC value for 
that model than the J-Rip model. 


5.2 Performance for Demographic Groups 
We then tested both the cross-validated predictions models by 
three sets of demographic comparisons: gender (male .vs. 
female), race (white .vs. African-American) and whether the 
student’s parents attended college or not. For the J-48 model, we 
found that it performed relatively well across all the 
demographic groups tested, and close to the performance values 
obtained in the overall model. The model performances of the 
various demographic groups are listed in Table 2 below. Our J- 
48 model performed at similar levels for most of the 
demographic groups that were tested. However, it performed 
marginally worse for African-American students 
(Kappa = 0.728, AUC = 0.905). When compared to the model’s 
performance on the full data set (Kappa = 0.806, AUC = 0.925), 
its performance was still quite good in absolute terms even for 
this group. 


Table 2. Performance of J-48 models of student enrollment 
for different demographic groups 


Parents attended college 0.763 0.908 
Parents did not attend college 0.829 0.933 


Similarly, we found that our J-Rip model performed at 
comparable levels of performance across different demographics 
when compared to performance on the full data set. As with the 
J-48 model, the J-Rip model was least accurate for African- 
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American students, but still obtained good predictions, with 
Kappa = 0.748, AUC = 0.907. 


Table 3. Performance of J-Rip models of student enrollment 
for different demographic groups 


Parents attended college 0.774 0.896 
Parents did not attend college 0.854 0.921 


These findings suggest that the models obtained here are reliable 
across demographic groups, indicating that they can be used 
without concern regarding equity in their predictions. 


| 0.937 
| 0875 
| 0.907 


6. DISCUSSION 


To increase access to higher education for non-traditional 
students, institutions of higher learning have increasingly 
embraced online learning platforms to provide greater flexibility 
for working adults looking to return to school. Despite easier 
access, student retention and attrition has remained an important 
issue that online orientation courses like Engage aim to address. 


In our study on students taking the orientation course Engage, 
we generated a total of 139 features based on student actions 
within the Blackboard course platform and developed models to 
predict future student registration in a program-specific for- 
credit course within the state of Arkansas’s online eVersity. The 
features selected by our model were able to predict with high 
confidence levels the likelihood that students would register in a 
program-specific course after the orientation course. It is also 
notable that both the J-48 and J-Rip models selected the same 
feature (total number of views of Discussion Board Reply page) 
to be positively associated with future course registration. This 
finding echoes and provides support for earlier research 
suggesting that student participation in discussion boards is 
associated with better retention and achievement [18, 23]. 


The features selected in both our models, while not surprising, 
provide important implications that help guide administrators 
and facilitators to design interventions that can better identify at- 
risk students who may not continue on after the orientation 
course. For instance, the feature of discussion board reply views 
appeared to have a very strong association with future 
registration in an eVersity course. According to previous 
research, students’ interactions within a course help improve 
student retention rates [14, 23]. Students who accessed the 
Discussion Board Reply page more often are more likely to be 
interacting with other students and course facilitators. In this 
manner, these students may experience greater engagement in 
the course and the eVersity program, which in turn could 
explain the association between the students’ usage of the 
discussion board and future course registration within eVersity. 


Within the J-48 model, three other features were selected in 
addition to discussion board reply views. The total number of 
views of the Messages page was also included in some models 
during cross-validation, even though it was not included in the 


final decision tree built on the entire data set. Like the 
Discussion Board Reply page views feature, this feature 
suggests that students who have more interactions with other 
students and course facilitators are more likely to register in 
another eVersity course afterwards. 


Features on the number of days and frequency of the student 
checking of course grades appear to have positive associations 
with future course registration as well. From the decision tree 
generated with the J-48 algorithm, students that only view their 
course grades after a long period of time have a high likelihood 
of not registering for another eVersity course in the future. This 
can be another useful indicator of students who may not be as 
engaged in the eVersity program and their achievement in the 
orientation course, and who have a lower likelihood of 
registering for another eVersity course. 


After developing our models, we tested their reliability across 
different demographic groups. We found that the models 
performed equally well across students of different race and 
gender, as well as between groups of students with parents who 
attended or did not attend college. These findings suggest that 
our model is not overtly biased towards or against a specific 
demographic group. 


Based on our models’ performance and the features selected, 
course administrators and facilitators could make _ further 
improvements to Engage to increase student retention in the 
online eVersity program. Since some of the selected features 
involve student interactions, course facilitators could try to 
embed more interactive activities within Engage to encourage 
students to reach out to their peers as well as to the program 
facilitators, and participate more actively in eVersity’s social 
community. Given that discussion board views had high 
predictive power for future course registration within eVersity, 
Engage course facilitators could encourage student participation 
in discussion boards early on in the course, and maintain a 
stronger presence within discussion boards to provide a more 
robust and consistent form of support for students embarking on 
the eVersity program. Nevertheless, it is worth noting that 
student participation in discussion boards may also be a proxy 
for student interest in the course content or their overall goal of 
studying within eVersity. Actions taken by course facilitators to 
encourage student participation in discussion boards may not be 
as helpful in increasing student engagement or interest in the 
course content. Alternatively, it may be more effective for 
course facilitators to tweak the discussion board activities to 
ensure that they are optimally interesting and relevant to the 
learners participating in the orientation course. 


7. CONCLUSION 


In this study, we made use of student interaction data from a 
credit-baring online orientation course, Engage, in a completely 
online university, to build a prediction model of student 
registration in future program-specific courses. The prediction 
models were developed using machine learning algorithms and 
tested across different demographic groups. Two algorithms 
were tested; the performance of both models was high, and the 
models provide indicators that predict future student registration 
in program-specific courses within the online eVersity program. 
These prediction models thus provide eVersity administrators 
and course facilitators with fine-grained information on student 
behavior within the orientation course that could improve 
student retention on eVersity. As such, further improvements 
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could be made to the orientation course Engage to accurately 
target students at risk of dropping out of the online eVersity 
program, and provide further support to these students at an 
earlier stage in their higher education journey. 
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ABSTRACT 


Question answering forums in online learning environments 
provide a valuable opportunity to gain insights as to what 
students are asking. Understanding frequently asked ques- 
tions and topics on which questions are asked can help in- 
structors in focusing on specific areas in the course content 
and correct students’ confusions or misconceptions. An un- 
derlying task in inferring frequently asked questions is to 
identify similar questions based on their content. In this 
work, we use hierarchical agglomerative clustering that ex- 
ploits similarities between words and their distributed rep- 
resentations, reflecting both lexical and semantic similarity 
of questions. We empirically evaluate our results on real 
world labeled dataset to demonstrate the effectiveness of 
the method. In addition, we report the results of inferring 
frequently asked questions from discussion forums of online 
learning environment providing lectures to middle school 
and high school students. 


Keywords 
frequently asked questions, agglomerative clustering, ques- 
tion similarity, community question answering. 


1, INTRODUCTION 


Self-paced online learning environments provide valuable learn- 


ing resources to a large number of students. A primary 
mechanism of interactions between the students are the dis- 
cussion forums. These forums enable students to ask ques- 
tions, answer questions and collaboratively learn. Ques- 
tion answering forums, are discussions forums where every 
thread is a question posted by a student - much like the com- 
munity question answering (CQA) platforms such as Stack- 
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Overflow’, Quora?. Over time, a large number of students 
may post similar questions that could indicate topics suscep- 
tible to confusions, misconceptions or course content requir- 
ing further explanations. Most question answering forums 
allow a student or user to search similar questions present 
in the archives, using information retrieval technique. While 
searching similar questions is useful for a student, it provides 
limited view to an instructor on frequently asked questions. 
A potential way to aid manual identification of common or 
frequently asked questions, in such forums is to employ clus- 
tering, so that semantically related questions are grouped 
together. 


Motivating Example: ‘Table 1 lists examples of sample groups 
of similar questions posed by middle and high school stu- 
dents on Khan Academy®. These groupings or question 
clusters can help an instructor identify key concerns or con- 
fusions among students. The instructor could address con- 
fusions by providing additional content on the specific topic. 
For example, many students are asking questions on the 
slope of vertical or horizontal line. Having a view of ques- 
tion clusters, can be valuable to the instructor and help in 
refining course content. 


Partition-based clustering methods such as k-means, k-mediods, 


k-means++ [9] need prior information about the number of 
clusters required. Providing number of clusters as input can 
be very hard for the instructors. Hence, in this work we 
use hierarchical clustering [9] that does not have an input 
requirement. Dendrograms (a tree of clusters), that cap- 
ture results of hierarchical clustering, can allow instructors 
to extract clusters of different granularities without having 
to re-run the clustering algorithm. Further, most algorithms 
of hierarchical clustering, provide the flexibility to choose a 
distance metric that we utilize in this work. 


Existing work on processing CQA archives, identify or rank 
similar questions given a new question [12]. While the prob- 
lem of estimating relevance of questions to address a new 
question is a related to estimating similarities between ques- 
tions to identify clusters, much of the work done to address 


1 www.stackoverflow.com 
? www.quora.com 
3www.khanacademy.org 
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Table 1: Examples of frequently asked questions. 


Video Lecture Student Questions 


What would the line look like if the slope was a zero? 
What is the slope of a horizontal line’ 
Graphing a line in slope intercept form | what about vertical lines? do they have slope: 


Would a vertica 
line imply a zero slope? 


Why not use L’hopital’s rule’ 
can you use l’hopital ’s rule to prove this hmit ‘ 


Proof of Limit sin(x)/x 


you Call alSO USE 


opital ‘s rule to turn sznaz/x turn into into cosx/1 


yan you also prove this limit using L’Hopital’s rule’ 


...8tn(a2)/% == cos(x)/1 and 


the former problem, uses supervised learning approaches 
that require labeled datasets for training and building mod- 
els. 


Our Contributions: We address the problem of inferring fre- 
quently asked questions (FAQ) by harnessing a distance met- 
ric that that uses the similarity of the words in the question 
using a lexical database (such as WordNet*) and the word 
embedding space representation that depicts contextual sim- 
ilarity of words. We further provide a flexible way of cutting 
the output of the clustering algorithm, dendrogram, allow- 
ing the end user to identify clusters of questions. A range, 
specifying the number of points needed to define a cluster is 
taken as input. The generated clusters are sorted by the dis- 
tance metric, thus enabling instructors to filter and identify 
relevant question clusters. 


2. RELATED WORK 


In this section we position our work in the context of existing 
literature along two directions: (1) Analyzing textual con- 
tent available in student discussion forums, (2) Processing 
questions in community-based question answering (CQA) 
systems. 


2.1 Student Discussion Forums 
There has been a growing body of research on analyzing 
the textual discussion forum data in Massively Open Online 


Courses (MOOCs). 


A precursor to analyzing questions is determining the ut- 
terance of students or classifying the dialog act of the stu- 
dents (such as asking questions, giving feedback or agreeing 
and disagreeing). Ezen-can et al. [4], apply k-medoids clus- 
tering algorithm and qualitatively evaluate the clusters to 
group dialog acts and topics. In our work, we analyze posts 
that are categorized as questions. Topic analysis of MOOC 
discussion content using Structural Topic Model (STM) has 
been explored by Reich et al. [15]. While topic labels are 
useful in providing a broad overview of the themes that are 
attracting student discussions, they do not help the instruc- 
tor in analyzing finer details of what students are asking or 
answering. In one of the recent work Thushari et al. [2], 
present a ‘topic-wise organization’ of discussion posts by us- 
ing Latent Dirichlet allocation (LDA) on the discussion data. 
The authors present a topic visualization dashboard that 


*https: //wordnet.princeton.edu/ 
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would assist MOOCs staff in understanding emergent dis- 
cussion themes or identifying popular topics [1]. Our work 
uses questions in the student question answering forums and 
evaluates the semantic similarity between pairs of questions 
to identify similar question clusters. The work presented 
here can be used on the subset of discussion posts that have 
been tagged or organized into a topic. 


In addition, discussion forum data has been utilized for a 
wide variety of purposes, recent among these is the analysis 
of information seeking behavior of students (that includes 
querying, refining the query, reading and browsing), while 
they learn programming [8]. Sentiment analysis in discus- 
sion forums [18], examining relationship between students’ 
discussion behaviors and their learning [17] [6], explore var- 
ious possibilities of using the forum as a rich source of data. 


2.2 Community Question Answering (CQA) 
The popularity of CQA indicates that users find them use- 
ful in finding answers to their questions. However, there are 
several issues related to CQA that has led to a large body of 
research: 1) Identifying good and relevant answers to ques- 
tions can help users filter noise in the responses. 2) Identi- 
fying questions that may be repeated or closely related to 
previously asked questions can help eliminate redundancy. 
The latter issue, relates very closely to the problem we ad- 
dress in our work. 


One of the recent tasks in SemEval 2016 [12] dealt with iden- 
tifying and ranking a set of 10 related questions given a new 
question. The participating teams in the task, built super- 
vised machine learning models that used distributed repre- 
sentation of words, knowledge graphs to define lexical and 
semantic features [5], neural network approaches including 
convolution neural nets (CNN) or Long short term memory 
(LSTM) networks [11], [16], [13]. The focus of their work is 
to rank the questions in a relevant manner considering se- 
mantic similarity. A prerequisite to using these approaches 
in practice, is the need of a labeled dataset. In our work, we 


use an unsupervised method that circumvents the need for 
labeled data. 


Clustering questions answers (QA) from the CQA systems to 
ease tasks such as tagging has been less explored. In one of 
the recent works [14], the authors identify clusters of related 
QA. The approach is based on classical k-means clustering 
algorithm, but mixes the similarities of the questions and 
answers to define an objective function that is optimized over 
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questions 


} Stopword removal 
Spell Checking 
Lemmatization 


Question-Question 
Distance metric 
Hierarchical 

Clustering 
Dendrogram 


Figure 1: Identifying commonly asked questions. 


Jy Lexical similarity using dictionary 
| Word embedding similarity 


r Single Linkage clustering 
Complete Linkage clustering 


# of questions per 
cluster [range] 


multiple iterations. While our goal is to cluster questions 
and use an unsupervised model, we do not rely on the answer 
information, primarily because the answers given by peers 
students may contain irrelevant information, especially with 
students from middle school. 


3. IDENTIFYING COMMON QUESTIONS 


Our method to infer or identify commonly asked questions 
is organized into multiple steps, as shown in Figure 1. The 
first step deals with preprocessing the question to remove 
any noise. Next, we focus on the key aspect of any clustering 
algorithm; the choice of (dis)similarity function or distance 
metric between a question pair. The hierarchical clustering 
algorithm uses the distance metric to derive the output as 
a dendrogram. Finally, the dendrogram is partitioned and 
the clusters are identified. 


3.1 Preprocessing 

In the preprocessing phase, for each question we filter all 
URL, email addresses or other similar such patterns which 
may be irrelevant in the context of the data being analyzed. 
The misspellings are corrected using the WordNet database. 
Stopwords are removed and the remaining words in each 
question are lemmatized to their base forms using the lem- 
matizer provided by Stanford Core NLP parser? 


3.2 Question-Question Distance Metric 

The distance function uses the combination of both the lex- 
ical and word embedding similarity. We define the distance 
metric between question pairs qi, q; as follows: 


dist(qi,qj) = KOREA ELEN EG Pee 

1 
where, Dbow(gi, q;) is the distance computed based on the 
lexical similarity and Dyec is the distance computed based 
on word embeddings for question pair (qi,q;). The follow- 
ing section describes the distance metrics in detail. The 
distance function {2 is the weight associated with lexical or 
word embedding based distance. As stated by the authors 
in [14], the metric represented as (a” + b”)1/” approximates 
to max{a, b} for high positive values of x and to min{a, b} 
for high negative values of x. 


http: //stanfordnlp.github.io/CoreNLP / 
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3.2.1 Lexical Similarity 

Each question is represented as a bag of words vector. The 
dimension of the vector being the vocabulary size of the 
question corpus W. Each word w; in the question and its 
associated synonyms are identified from the WordNet lexical 
database. The words are weighted by their idf measure. The 
idf measure is given by 


7 |D| 

idf (ws) = tog ( (2) 
df (wi) 

where, D is the corpus size and df(w;) is the number of 

documents containing w;. Similarity between two question 

Simbow(q, qj) is computed using the cosine similarity of the 

question vectors. The distance is defined as: 


Doig (qi, qj) 7 Simoow (qi, qj) (3) 


3.2.2. Word Embedding Similarity 


Each question is represented as a weighted combination of 
embeddings of words in the question. The word vector vw 
for each word w in the question is identified using the dis- 
tributed representation of words generated by the word2vec 
tool [10]. Each question q is represented as: 


a ae 
Va = gj 2 ary) 


Similarity between two question Simvec(qi, q;) is computed 
using the cosine similarity of the question vectors. The dis- 
tance between question pairs qi,q,; is defined as: 


Deel Gi, qj) Sis Siitiwee (Oe, qj) (5) 


3.3. Hierarchical Clustering 

We use agglomerative hierarchical clustering. Initially, each 
question is in its own cluster. ‘The nearest clusters are 
merged until there is only one cluster left. The end re- 
sult is a cluster tree or dendrogram. ‘The tree can be cut 
at any level to produce different clusters. There are two 
types of clustering methods. The Single Linkage approach, 
merges two clusters by considering the minimum distance 
between the points in clusters to be merged. In Complete 
Linkage approach, two clusters are merged by considering 
the maximum distance between the points in the clusters. 
Complete linkage clustering results in more compact clusters 
as the merge criterion considers all points in the cluster. We 
use complete linkage clustering. ‘The worst case run time 
complexity of agglomerative clustering is O(n? logn) which 
makes it too slow for large datasets. The primary advan- 
tage of the clustering approach is that it does not require 
any prior input to generate the cluster tree. 


We evaluated another clustering algorithm Density-based 
spatial clustering of applications with noise (DBSCAN) [3], 
which has a worst case run time complexity of O(n”). The 
inputs to the DBSCAN, are the minimum number of points 
to form a cluster and the distance threshold eps such that, 
for every point in the cluster, there exists another point in 
the same cluster whose distance is less than the eps. Select- 
ing distance threshold as an input can be a challenge. The 
resulting clusters can vary significantly with eps. 
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(b) Resulting clusters 


Figure 2: (a) Dendrogram (b) Clusters identified for 
input range of number of points. 


3.4 Dendrogram 

The output of the hierarchical clustering is a dendrogram 
as shown in Figure 2(a). A typical approach is to cut the 
dendrogram at a specific distance and identify the resultant 
clusters. However, a dendrogram can be cut at different dis- 
tances based on the domain or application specific informa- 
tion. In our scenario, an important input from the instruc- 
tor, is the minimum number of points or questions in cluster, 
for it to be considered as a FAQ. An instructor may decide, 
that she would like to address groups of at least 4 similar 
questions, or provide a range of question sizes as input. Fig- 
ure 2(b) depicts such a scenario of wanting a range of [8, 4] 
questions in each cluster. We use number of questions as the 
input and provide a list of question clusters sorted by the 
cluster distance. Hence, clusters that are linked with lower 
distance values form good quality clusters. As the distance 
function increases, the quality of the resulting cluster would 
be poor. 


4. EXPERIMENTAL EVALUATION 


In this section, we evaluate our method for identifying FAQ. 
We use a labeled data set from a CQA archive and create 
reference clusters. 


4.1 Data 


To evaluate the suitability of our approach, we use SemEval 
2016 Task 3 dataset that contains questions and answers 
from Qatar Living forum [12]. The data relevant for our eval- 
uation contain questions categorized as Original question. 
For each original question, a set of 10 related questions are 
annotated as PerfectMatch, Relevant and Irrelevant. Using 
the labeled information, we build a set of reference clusters 
or ground truth, which contain the original question and the 
related questions that are either PerfectMatch or Relevant. 
Table 2 contains the details of the data set. The test dataset 
contained of 770 questions. 


Table 2: SemEval 2016 Task3 dataset used. 


____Questions__| Training | Test_ 
- Original Questions | 200 [70 
[Total || __1,999 | 700_ 


: Relevant | 606 | 

cad tions |___Relevant_| 

Related Questions |-seeanarar [aera 
[—_trrelewant || 1212 


otal ——SSSS~*dYSC*C*«i TO PTT 


4.2 Evaluation Metrics 

The quality of clustering is measured using F-Measure, com- 
bining the precision and recall scores used in information re- 
trieval [7]. Each generated cluster Cgen is treated as a result 
of the query and each reference cluster Ce is considered as 
the desired set of documents or points: 


C yen a Cre? 


precision(C gen; Cree) = (6) 
Cgen 
eC oO Cgen O Cref (7) 
Crez 


2 . . . ; ll 
F — Measure(Cgen, Cref) = oe eee 


(8) 
The average precision, recall and F-Measure values are com- 
puted for each cluster containing the “original question”. For 
the purpose of evaluation, we use the test data set and iden- 
tify the partition or the distance threshold at which the max- 
imum average F-Measure is obtained. 


precision + recall 


4.3 Results 


The results of our approach are presented in Figure 3. We 
evaluate the cluster measures by considering the question- 
question distance metric using various values of (2 and x. 
High F-Measure and recall is achieved when we use lexi- 
cal similarity as the primary distance metric. Using word 
embedding as a primary similarity metric results in higher 
precision, which could be suitable in scenarios where the 
data is noisy or contains large number of irrelevant ques- 
tions. Figure 3(a) has varying weights associated to lexical 
and word embedding based similarity. When x = 0.5, a 
balance between high precision and high recall is achieved. 
Further, Figure 3(b), shows the metrics achieved by varying 
Q). Here, the best results are achieved with (2. = 4, with 
an F-measure of 0.653, a precision of 0.874 and recall of 
0.5609. The SemEval 2016 Task 3 participants reported un- 
official precision, recall and F-Measure values. Here, for each 
original question, Relevant’ and PerfectMatch questions are 
categorized as true pairs and Jrrelevant questions are cate- 
gorized as false pairs. The precision values reported by the 
top 4 participants ranged from 0.636 to 0.763. The recall 
values were higher and ranged from 0.553 to 0.759. The 
F-Measure was between 0.64 and 0.71. The results of our 
method are comparable and encouraging as we have used an 
unsupervised model. 


5. INFERRING FAQ FROM STUDENT QA 
SYSTEM 


In order to verify the relevance of the approach, we ran 
the clustering tool on a student question answering plat- 
form. The dataset for the analysis, was extracted from the 
Khan Academy, by permission, using screen scapping pro- 
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F-Measure 


Precision 


(a) « = 1, varying 2 from [0,1] 


0.5598 
0.9353 0.9359 
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Figure 3: F-Measure, Precision and Recall values by varying {2 and zx. 


Table 3: Sample FAQ inferred using proposed method from Khan Academy question answering forum. 


Video Lecture Student Questions 


what do the b stand for in the equation y = mx + b * 


Graphing a line in 
slope intercept form 


al said that O/U 1s undefined. ouldn’t 1t be not a number! 
p 218, w = 
1s O/O undefined, or ones and Why: 
thought that 0/0 is called a indeterminant not undehned. Correct my logic please 
WHY is anything divided by O considered as undefined?’ 
I’m trying to understand but, I see what he is doing but what ever he is saying is in slow motion 
so I don’t understand. And what is a piecewise function 
Uo you have a video where they give you a graph of a piecewise function, but need to find the 
Definition of function rule? 


row to find inequalities for piecewise functions: 
row do you graph piecewise functions! 
what 1s a plecewise function! 


o knowledge on trigonometry (just went through his 
videos once) so i dnt get what i am missing here: should he prove that for 3rd and 2nd quadrant 
as well?! 


s this statement 1s not applicable to Znd &srd quadrants ¢ Why: 
exactly why does this only apply to [st and 4th quadrant why not, 2nd and ord! 
what about the 2nd and 3rd quadrants: 


X would not be negative in the 4th quadrant.,x 1s only negative in 2nd and ord quadrant. 
why 1s he working In the nrst and tourth quadrants only! because the absolute value remains 
the same in all quadrants 

Proof of sin x by x Q14: Khan says that cos(x) 1s always the x value in the first and fourth quadrants. Doesn 
he mean that cos(x) and x have the same sign in the first and fourth quadrants? 
Why do we consider x only in the first and the fourth quadrant’ Does it change the result 11 we 
need to consider all the quadrants? 

understand everything except going into the fourth quadrant. 

of the video, he is discussing the fourth quadrant. 
Why go into the fourth quadrant, and why does he stay away trom the second and third quadrant ! 
why is he working in the first and fourth quadrants only’ because the absolute value remains 
the same in all quadrants 


Introduction to limits 


1 maclass 9 student and dont have 100% 


tocol. We considered micro lectures of 8°" grade mathe- can view questions that have been previously asked by their 
matics and micro lectures covering differential calculus. On peers. Once a question is asked, a discussion thread is ini- 
the learning platform, each micro lecture video has easy ac- tiated with peer students providing answers. ‘The data set 
cess to the page where questions for that lecture, can be contains about 22000 questions from 300 video lectures. As 
asked or viewed. Asking questions is voluntary. Each learner questions are asked in the context of a given micro lecture, 
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we infer the FAQs for each lecture. This helps us reduce the 
running time of our clustering algorithm. 


5.1 Discussion 

Table 3 presents a subset of the clusters or FAQs extracted. 
Four example clusters or FAQ are presented. We were able 
to extract 4 to 10 questions in each of the sample clusters. 
We observed several clusters with irrelevant questions, that 
resulted from poor semantic match when the question con- 
tent contained numerous mathematical expressions, symbols 
and less text. Our results can improve with domain spe- 
cific preprocessing. The current preprocessing step does not 
parse or process mathematical expressions. Identifying ex- 
pressions and tagging them as a special tokens for computing 
question-question distance could provide better results. We 
noticed several abbreviations in the questions, that were not 
handled by our preprocessing step. In addition, many stu- 
dents had questions related to content presented at specific 
time periods in the video lectures. Annotating terms repre- 
senting video lecture time period, as a part of preprocessing 
could help ascertain intervals of time within the lectures, 
where students are seeking more information. Such domain 
specific processing of content in questions could help improve 
the question-question distance metric and reduce noise in the 
generated clusters. 


6. CONCLUSION 


Our goal in this work was to identify FAQ from the ques- 
tion answering systems of online learning environments. We 
used agglomerative clustering, an unsupervised learning ap- 
proach, to identify the FAQ as it did not require any prior 
inputs to identify groups of questions. A distance metric 
was defined to harnesses similarity based on bag of words 
and word embeddings. Our empirical evaluation on labeled 
dataset shows the effectiveness of our approach, with the 
precision and F-Measure values comparable to the existing 
methods that use supervised models. We extracted ques- 
tions asked by students from Khan Academy and FAQ was 
extracted for each topic. In future, we would include the an- 
swers provided by students in identifying similar questions. 
The answers can be filtered based on the votes received, stu- 
dent popularity and other related answers in the posts. This 
would result in improving the quality of extracted FAQ. 
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ABSTRACT 


Massive Open Online Courses (MOOCs) are a promising 
form of online education. However, the occurrence of aca- 
demic dishonesty has been threatening MOOC certificates’ 
effectiveness as a serious tool for recruiters and employ- 
ers. Recently, a large-scale study on the log traces from 
more than one hundred MOOCs created by Harvard and 
MIT has identified a specific cheating strategy viable in 
MOOCs: Copying Answers using Multiple Existences On- 
line (CAMEO). In essence, learners create several accounts 
on a MOOC platform, request assessment solutions via some 
of the accounts, and then submit these “harvested” solutions 
in their main account to receive credit. In our work, we repli- 
cate the CAMEO implementation and apply it to ten edX 
MOOCs created by the Delft University of Technology. Our 
results show that in those MOOCs, 1.9% of certificates were 
likely earned through CAMEO cheating, a number compa- 
rable to the fraction of cheating observed in Harvard and 
MIT MOOCs. 


Keywords 
MOOCs, Academic Dishonesty, Multiple-Account Cheating, 
Educational Data Mining 


1. INTRODUCTION 


Cheating is generally defined as using dishonest means to 
gain an undeserved reward of ability or to get rid of an 
embarrassing situation [3]. Academic dishonesty is a type of 
cheating that occurs in relation to an academic exercise. It is 
a widespread occurrence across different levels and forms of 
education [4]. There are diverse cheating strategies adopted 
by students to implement academic dishonesty such as im- 
personation, bringing notes into the exam hall, using an 
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unauthorized digital device, and so on. 


MOOCs, which are courses designed with open access for 
a large number of online participants, have become a vital 
part of scalable and large-scale education. However, the 
effectiveness of MOOCs has been threatened by academic 
dishonesty. For instance, as early as 2012, some instructors 
have voiced concerns about various forms of cheating in their 


MOOCs [7]. 


One of the main issues in exploring the issue of cheating in 
MOOCs is the general lack of ground truth data — MOOC 
providers may be reluctant to confront learners (as a def- 
inite proof of cheating is difficult to come by and a time- 
consuming endeavour) and MOOC learners are reluctant to 
admit their misbehaviour. Recently, Northcutt et al. [5] pro- 
posed a first approach to automatically detect a particular 
kind of cheating purely based on the log data that is col- 
lected in major MOOC platforms; they termed this method 
CAMEO or Copying Answers using Multiple Existence On- 
line. In brief, this method is able to detect learners that 
cheat in the following way: (1) A learner registers multi- 
ple accounts on a MOOC platform and enrolls in a MOOC 
of interest with all these accounts; one of those registered 
accounts is the learner’s main account. (2) The learner 
uses some of the registered accounts to randomly submit an- 
swers to assessment questions (which in MOOCs are often 
multiple-choice or fill-in-the-blank questions to enable au- 
tomatic grading) as a way to harvest the correct solutions. 
This is made possible by a design decision of major MOOC 
platforms which allows learners to check their submitted so- 
lutions immediately after submission. (3) The learner then 
submits the harvested solutions through the main account, 
allowing the learner to successfully complete the course and 
earn a certificate. Commonly, achieving 60% (or a similar 
percentage) of all possible points is sufficient to receive a 
MOOC certificate. 


Among the many potential ways of cheating in MOOCs, 
CAMEO is of particular concern for a number of reasons: 
(1) the CAMEO cheating strategy can be performed by ev- 
ery learner individually, it does not require learners to col- 
laborate with others; (2) CAMEO cheating is efficient and 
easy to execute as it directly utilizes the solutions provided 
in a MOOC; and (3) CAMEO cheating can be applied across 
many different MOOCs, largely independent of the subject 
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or course level. 


Northcutt et al. [5] observed CAMEO cheating in 69 Cours- 
era MOOCs (out of 115 investigated) provided by MIT and 
Harvard University; among those 69, approximately 1.3% of 
the certificates were issued to learners identified as CAMEO 
users. Given that MOOCs provided by different universi- 
ties usually attract varying sets of learners, in this work, we 
investigate the following two Research Questions: 


RQ1 What is the prevalence of CAMEO cheating in the 
MOOCs provided by TU Delft? 


RQ2 What are characteristics of learners identified to have 
employed the CAMEO strategy? 


To answer these questions, we implement the detection ap- 
proach as described in [5] and apply it on the log traces 
of 10 edX MOOCs. We find that 1.9% of the certificates 
are earned by CAMEO learners (our answer to RQ1), with 
some types of MOOCs more prone to cheating than oth- 
ers. While we did not observe any CAMEO behaviour in a 
MOOC on political debates, we found more than 6% of cer- 
tificates to be CAMEO certificates in a business and tech- 
nical course respectively. With respect to RQ2, we observe 
cheating to be most prevalent mid-course and to be more 
prevalent in some user demographics than others. 


2. RELATED WORK 


There are a few works proposed to investigate the preva- 
lence of cheating in MOOCs. Two of the earliest works were 
proposed by [5] and [6]. Both of these two works focused 
on the detection of CAMEO cheating based on learnersaAZ 
traces in MOOCs provided by MIT on edX. 


In [5], 1.3% of the certificates among 69 MOOCs cover- 
ing different subjects were earned by learners who adopted 
CAMEO cheating strategies. Learners who applied CAMEO 
are more likely to be young, male and international than the 
other certified learners. In [6], the number is 10.3% of the 
certificates in an introductory physics MOOC. 


In both of these works, researchers set patterns of CAMEO 
and select learners whose behaviors satisfy the patterns. 
There are overlaps between the criteria adopted by the two 
works. Ruiperez-Valiente et al. [6] has relatively more de- 
tailed assumptions to CAMEO in different modes. North- 
cutt et al. [5] was conducted in more than 100 MOOCs, 
which helps to avoid the accidental bias in the prevalence of 
CAMEO caused by courses. 


Compared to these works, our goal is to investigate the 
prevalence of this cheating behavior in the MOOCs pro- 
vided by TU Delft and what the common characteristics 
are among the detected cheaters. 


3. DETECTION METHOD 


In this section we recap the main assumptions that under- 
pin Northcutt et al. [5]’s approach. Note that these assump- 
tions are derived from intuitions about MOOC learners’ 
(or more generally online users’) behaviours on the learning 
platform. Our implementation of the approach matches the 
original paper’s algorithmic formulation as closely as possi- 


ble. 
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e CAMEO users hold at least two accounts. Each 
CAMEO user (i.e. a learner who cheats to gain an ad- 
vantage in a MOOC) should use one or more accounts to 
harvest solutions (so-called Harvest Account(s)) and one 
main account to submit the correct solutions (i.e., the 
Master Account) so as to earn the certificate. Initially, 
every possible pair of user accounts having enrolled in a 
particular MOOC is a candidate Master/Harvester pair. 


e CAMEO users harvest solutions before entering 
them into their Master Account. In other words, for 
questions that learners cheat on, the candidate Harvester 
Account should precede the candidate Master Account in 
time for the gathering of solutions. 


e CAMEO users quickly pass collected solutions from 
Harvester Accounts to Master Account. It is rea- 
sonable to assume that a cheater may simultaneously log 
in both the Harvest Account and the Master Account, and 
once the learner collects the correct solutions, he may im- 
mediately submit the correct solutions through the Master 
Account. This assumption requires the time difference be- 
tween the correct submission from the candidate Master 
Account and the request to solutions from the candidate 
Harvester Account to be small. 


e Master Accounts are certified, the Harvester Ac- 
counts are not. Given that Harvester Accounts are 
mainly used to gather correct solutions via randomly sub- 
mitting answers, more often than not, the Harvester Ac- 
counts do not reach the passing threshold of a MOOC. At 
the same time, the Master Accounts should perform well 
in that respect and earn a certificate. 


e Master Account and Harvester Account are con- 
nected via IP addresses. As noted before, a CAMEO 
user may simultaneously log into multiple accounts on one 
and the same or different devices in the same location; 
thus, it is likely that Master and Harvester account share 
a common logged IP address during the MOOC. 


In the CAMEO approach, these intuitions are transformed 
into filtering rules (that filter the initially created account 
pairs) and only candidate Master/Harvester pairs that meet 
all of these criteria are considered to be CAMEO users, that 
is, learners who cheat through multiple account usage in a 
MOOC. Most of these rules contain ad-hoc parameters (e.g. 
the time limit between a Harvester and Master account sub- 
mission); we have followed the parameter settings described 
in [5] in our implementation. 


4. EXPERIMENT 


4.1 Dataset 

Our study is based on the log data generated during 10 edX 
MOOCs (eight different MOOCs of which two ran twice) 
which were provided by TU Delft between 2014 and 2016. 
The MOOCs cover various scientific areas including data sci- 
ence, programming paradigms, biotechnology, business and 
political science. An overview of the MOOCs, including the 
number of enrolled learners and the number of certificates 
earned is shown in ‘Table 1. 
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Table 1: Overview of the ten MOOCs included in this study. 74#Enrollments shows the number of user accounts that 
registered for each MOOC and #Certificates lists the number of registered participants that achieved a certificate (the 
passing threshold is 50% for Frame101x and 60% for all other MOOCs). Note that FP101x and EX101x are listed twice, as 


they both ran in two different time periods. 


Course Code Course Title 


FP101x Functional Programming 
CTB3365DWx Drinking Water Treatment 


EX101x Data Analysis 

Framel101x Framing: How Politicians Debate 
Calc001x Pre-university Calculus 

EX101x Data Analysis 

IBO1x Industrial Biotechnology 

FP101x Functional Programming 

RI101x Responsible Innovation 
CTB3365sTx Urban Sewage Treatment 


Table 2: Overview of the detected CAMEO users and the 
percentage of certificates gained by CAMEO users. The last 
row shows the numbers across all ten MOOCs. 


#CAMEO % CAMEO 


Course Code 


Users Certificates 
FP101x (2014) 13 0.96% 
CTB3365DWx 4 1.63% 
EX101x (2015S) 2d. 1.23% 
Framel01x 0 0 
Calc001x 13 3.63% 
EX101x (2015F) 20 1.73% 
IBO1x 12 3.65% 
FP101x (2015) 16 1.40% 
RI101x if 6.19% 
CTB3365sTx 25 6.93% 
Total 137 1.89% 


4.2 CAMEO Detection Results 

For each of the MOOCs, we present the number of detected 
CAMEO users (and subsequently the percentage of certifi- 
cates gained through CAMEO) in Table 2. CAMEO users 
are detected in 9 out of the 10 MOOCs and overall account 
for 137 (or 1.89%) of all certificates. This percentage is 
slightly higher than Northcutt et al. [5]’s (1.8%). The per- 
centages vary across courses, with Urban Sewage Treatment 
being the MOOC with the largest percentage of CAMEO 
learners, nearly 7%. On the other hand, our only MOOC 
without CAMEO cheating detected is Framing: How Politi- 
cians Debate. In future work we will investigate this variance 
in CAMEO between courses; we hypothesize that for par- 
ticipants in Framel01x a certificate has less intrinsic value 
(the self-development aspect is more important) and thus 
cheating is less likely to occur. 


4.3 Verification of CAMEO Users 


To explore how plausible the detection results are — 1.e., are 
the detected account pairs actually belonging to the same 
learner and did the learner indeed cheat — we manually ver- 
ified key account characteristics. It is sensible for instance 
to assume that at least some CAMEO users register with 
the same/similar name across the Harvester and Master Ac- 
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Session #Enrollments #Certificates 


2014 Fall 37,940 1,356 
2014 Fall 10,458 246 
2015 Spring 33,015 2,190 
2015 Spring 34,017 919 
2015 Summer 27,857 3508 
2015 Fall 21,041 1,156 
2015 Fall 8,143 329 
2015 Fall 20,936 1,143 
2016 Spring 2,741 113 
2016 Spring 9,566 361 


count. Indeed, among our 137 detected CAMEO users, 20% 
have similar or even same registered full names attached 
to their Harvester and Master Accounts’. To provide the 
reader with some intuition on the similarities, we now de- 
scribe for a randomly picked CAMEO user in our dataset 
the similarities between the detected Master and Harvester 
Account: 


e The Harvester & Master Account have the same registered 
full name. 


e The registered email addresses of the Harvester & Master 
Account contain a common long character sequence (eight 
characters). 


e The Harvester & Master Account utilize the same IP ad- 
dress to answer every question. 


e The Harvester & Master Account submit answers within 
60 seconds for every harvested question and the Harvester 
Account always submits before the Master Account. 


e The Harvester Account submits answers for all questions 
in the course, but the correctness is only 11.5%. 


Based on these observations, we are highly confident that 
the learner is indeed a CAMEO user. 


4.4 Characteristics of CAMEO Users 

To gain a better understanding of the detected CAMEO 
users, we analyze their characteristics and patterns. With 
respect to the nationality of the certified learners, we find 
them to come mainly from the US, the Netherlands and the 
UK. However, the detected CAMEO users are mainly from 
India (27), the US (12) and Germany (7). 


We are also interested in the motivation of CAMEO cheaters, 
i.e., what drives them to cheat in MOOCs. Intuitively, we 
believe that most CAMEO users to be strongly goal-oriented 
with the goal being the certificate (instead of the goal be- 
ing related to knowledge gains). To verify this intuition, 
we compute how many detected CAMEO users would be 


‘We compute the similarity between two account names ac- 
cording to the Ratcliff/Obershelp sequence match method 
[1]. 
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Table 3: Overview of the identified CAMEO learners and 
their certificate status (pass or fail) if the assessments points 
they gained through CAMEO were removed. 


Pass w/o Fail w/o 
Course Code renlis apes 
FP101x (2014) 2 11 
CTB3365DWx 0 4 
EX101x (2015S) 3 24 
Framel0O1x 0 0 
Calc001x 0 13 
EX101x (2015F) 4 16 
IBO1x 0 12 
FP101x (2015) : 14 
RI1LO1x 1 6 
CTB3365sTx 0 25 
Total 12 125 


able to earn a certificate without CAMEO cheating. Specif- 
ically, we calculate the grades of CAMEO users on the con- 
dition that they only receive credits for questions they did 
not cheat on and evaluate whether the scores are sufficient 
to pass the course. As shown in Table 3, nearly 90% of the 
CAMEO users cannot pass the MOOCs without cheating, 
which implies that most of the CAMEO users are purely 
certificate-driven. 


We also investigate when CAMEO users are most likely to 
cheat during the course of a MOOC. To this end, we select 
FP101x (2014 and 2015) and EX101x (2015 Spring and 2015 
Fall) for analysis as the grading strategies adopted across the 
four MOOCs are very similar: almost all questions (more 
than 100 per course) are worth a single point and the final 
grade is simply based on the fraction of questions the learner 
answered correctly (with 60% of correct answers being the 
passing threshold). Figures 1 (FP101x) and 2 (EX101x) 
show the number of identified CAMEO users that resort to 
the CAMEO strategy across the different course weeks. 
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Figure 1: Average Number of CAMEO Cheater Cheating 
on per Question in Different Weeks in FP101x. 


Few learners resort to CAMEO in the first two weeks of the 
course, while course weeks 3, 4, 5 and 6 attract the most 
cheating. This is not overly surprising considering the fact 
that the questions in later weeks are usually more difficult 
than those in early weeks. The trend of decreased CAMEO 
in the final week(s) can be explained by the fact that the 
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Figure 2: Average Number of CAMEO Cheater Cheating 
on per Question in Different Weeks in EX101x. 


edX platform provides a Progress page where each learner 
can check his progress towards the passing threshold. For 
a learner whose main goal is the certificate, the realization 
of that goal (which can occur already as early as week 5 as 
the passing threshold is 60%) is likely to reduce or stop his 
CAMEO behaviour. 


5. CONCLUSION 


We successfully replicated the CAMEO strategy formalized 
in [5] and applied it to a novel set of MOOCs. Overall, we 
found similar percentages of CAMEO cheating in TU Delft 
MOOCs (1.9% vs. 1.3%), albeit with the limitation that we 
only explored 10 MOOCs (vs. 115 by MIT/Harvard). We 
are currently enlarging the study to include all 50 MOOCs 
that are provided by TU Delft. Our future work will place 
a greater emphasis on the demographic analysis of CAMEO 
users and on ways to reduce and prevent such cheating — 
either through technological means or ethical appeals and 
moral reminders [2]. 
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ABSTRACT 

One of the most challenging tasks in the field of Educa- 
tional Data Mining (EDM) is to cluster students directly 
based on system-student sequential moment-to-moment in- 
teractive trajectories. The objective of this study is to build 
a general temporal clustering framework that captures the 
distinct characteristics of students’ sequential behaviors pat- 
terns, that tracks whether a student’s learning experience is 
unprofitable, and can identify such an individual as early as 
possible so personalized learning can be offered. The central 
idea of our framework is based on Dynamic Time Warping 
(DTW), which calculates distance between any two tempo- 
ral sequences even with different lengths. In this paper, we 
explore both the original DTW and our proposed normalized 
DTW to generate distance matrix and apply Hierarchical 
Clustering to the resulted distance matrix. To fully evaluate 
the power of our temporal sequential clustering framework, 
we calculate distance matrix at three types of granularity 
in the increasing order of: problem, level, and session across 
three training datasets. As expected, results show that clus- 
tering moment-to-moment temporal sequences at problem 
granularity is more effective than level and session granu- 
larity. In addition, our proposed normalized D'T'W is more 
effective than both original DTW and the baseline Euclidean 
distance. 


Keywords 


Clustering, distance matrix, dynamic time warping 


1. INTRODUCTION 


The impetus for the development of many Intelligent Tutor- 
ing Systems (ITSs) was the desire to capture the effective 
learning experience provided by human one-on-one instruc- 
tion. ITSs have shown positive impact on learning but the 
degree of their effectiveness often depends on individual stu- 
dent’s motivation, incoming competence, etc. In ITSs, the 
system-student interactions can be viewed as a sequential 
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action-response process. Each of these interactions will af- 
fect the system-student’s subsequent interactions. As one of 
the great promises of IT'S is to support personalized learn- 
ing [15], the system-student moment-to-moment interactive 
trajectories often have vastly different lengths while most 
existing clustering approaches including K-means and Hier- 
archical Clustering are not designed to directly handle such 
temporal sequential datasets. Therefore, the main objective 
of this research is to build and evaluate a general cluster- 
ing framework that captures the distinct characteristics of 
system-students’ sequential interactive behavioral patterns, 
that tracks whether a student’s learning experience is un- 
profitable, and can identify such an individual as early as 
possible so personalized learning can be offered. 


Previously, various clustering methods have been widely ap- 
plied for different Educational Data Mining (EDM) appli- 
cations such as temporally coherent clustering [7], collabo- 
rative learning [9], reading comprehension [13], handwritten 
coursework [4], and personalized e-learning [8]. However, as 
far as we know, most of the prior research has used datasets 
that consist of per-student feature vectors that swmmarize 
a student’s entire interaction trajectory but do not consider 
the sequential nature of the interactions; or sequential data 
where the student’s behavior is extracted as a sequence of 
feature vectors but the length of the sequence is fixed. Nei- 
ther approach directly handles the moment-to-moment tem- 
poral dependency and different length of interactive trajec- 
tories. Therefore, we implement Dynamic Time Warping 
(DTW) [11] which calculates the distance between any two 
sequences of different lengths and also considers moment-to- 
moment dependencies. 


We proposed a general temporal clustering framework that 
would firstly construct a specified distance matrix on the se- 
quential dataset and then apply clustering approach on the 
resulted distance matrix. We tested our framework across 
three datasets collected in Fall 2015, Spring 2016 and Fall 
2016 semesters. All participants were trained on a logic tu- 
tor named Deep Thought (DT) and they were assigned to 
different conditions based on how the tutor decided whether 
to assign a Problem Solving or a Worked Example on next 
problem. 'Two-three weeks after the training, all partici- 
pants took a in-class midterm as the PostTest. Much to 
our surprise, empirical results showed no significant differ- 
ence among different conditions on Post'Test scores across 
all three semesters. So we explored whether our proposed 
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general temporal clustering framework would generate effec- 
tive clusters to predict student Post'Test scores. ‘To do so, 
we explored three types of granularity in increasing order 
of problem, level and session. More specifically, a session 
contained a student’s entire training session on the tutor 
which involved six levels and each level contained multiple 
training problems. For three types of granularity: prob- 
lem granularity recorded students’ problem-by-problem be- 
haviors and thus had different lengths for different students 
since the number of problems that students solved on DT 
varied greatly: from 19 to 65; level granularity contained the 
sequential data with a fixed length of six, one per level, for 
each student; and session granularity had one single summa- 
rized feature vector for each student. In our case, we treated 
session granularity as the baseline for early detection and 
investigated the impact of different types of granularity on 
clustering results. 


In this work, we applied three distance functions including 
DTW, normalized DTW and Euclidean distance, and imple- 
mented Hierarchical Clustering with four different linkage 
functions. Finally, we evaluated the goodness of clusters 
on PostTests. Our results showed that significant differ- 
ence was consistently found among the discovered clusters 
when clustering student trajectories at problem granularity 
rather than level and session granularity, and the best re- 
sult is found when using the first four out of six levels of 
trajectories rather than using entire trajectories. Therefore 
it suggested that using fine-grained problem granularity was 
more suitable for clustering student interactive trajectories 
than coarse-grained level and session granularity. 


2. RELATED WORK 


2.1 Previous Research on Clustering 

Previous research has showed the value of clustering for var- 
ious applications in EDM. For example, clustering has been 
widely used in student modeling. Yue Gong et al [3] im- 
plemented k-means on to identify clusters with distinct stu- 
dents’ skill and then applied knowledge tracing model to 
model students from each cluster separately in order to de- 
tect students’ knowledge level. They found that clustering 
had positive impact on student modeling, providing a good 
representation of student knowledge. Furthermore, Terry 
Peckham and Gord McCalla [13] utilized k-means in reading 
comprehension tasks and determined four different clusters 
based upon cognition skills including positive or negative 
reading, scanning or scrolling behaviors. 


Relatively little research has done to directly cluster student 
trajectories. Generally speaking, most of the prior research 
used either per-student feature vectors or the sequential data 
with fixed length on such task. For the former case, Ke Niu 
et al [12] extracted the feature vector per learner through 
analyzing his/her behavior and then applied spectral clus- 
tering algorithm to classify students’ performance in order to 
provide benefit for personalized services. They categorized 
students’ performance into nine classes and evaluated clus- 
tering results based on accuracy. Similarly, Gholam Mon- 
tazer [10] proposed hybrid clustering method to group learn- 
ers in E-learning systems and evaluated clustering results by 
comparing clustering labels with the ground truth labels. 


For using sequential data but with fixed length, Severin Klin- 
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gler et al [7] designed a pipeline for evolutionary clustering 
on student behavior sequential data with fixed length in or- 
der to group students at any time point and to identify the 
change of clusters over time. Particularly, Markov Chain 
model is applied to transfer the original behavior data as 
well as to capture the moment-to-moment temporal depen- 
dency. The optimal number of clusters is selected based 
upon the best model, evaluated by Akaike information cri- 
terion (AIC). Different from this work, we try to clustering 
the sequences with different lengths. 


2.2 Application of DT'W 


DTW has been successfully applied to a variety of applica- 
tions related to time series data, such as time series index- 
ing [6], classification [14] and clustering in domains of as- 
tronomy, speech physiology, and medicine [1]. More specif- 
ically, Hesam Izakian et al [5] applied fuzzy clustering with 
DTW distance approach on UCR time series data sets and 
evaluated the performance of clustering methods based on 
precision value. In addition, Gangarski, Pierre et al [2] uti- 
lized D'T'W to capture the semantic proximity between urban 
blocks on spatial temporal topographic databases and imple- 
mented ascendant Hierarchical Clustering to detect the dis- 
tinctive evolutions of urban blocks. Furthermore, Nurjahan 
Begum et al [1] explored DTW by adding pruning strate- 
gies and did the multidimensional time series clustering on 
different types of data sets in astronomy, speech physiol- 
ogy, medicine, entomology and astronomy domains. ‘They 
evaluated performance of clustering approaches in term of 
accuracy. 


As far as we know, this is the first study of applying DTW 
to the field of EDM by directly clustering student-system 
interactive sequential trajectories. Given the special nature 
of EDM, we further propose normalized DTW and find that 
normalized DTW is more effective to our task than original 
DTW. 


3. METHODOLOGY 


In this section, we first introduce the original and the pro- 
posed normalized DTW for calculating the distance matrix 
between any pair of student interactive trajectories, and 
then describe how we apply Hierarchical Clustering to iden- 
tify clusters with distinctive behavior pattern and perfor- 
mance. 


3.1 Distance Function 
3.1.1 Dynamic Time Warping (DTW) 


Given sequences X = {21,%2,...,un} and Y = {y1, ya,..., ym} 
with different lengths (VN ~¢ M), a warping path W is an 
alignment between X and Y, involving one-to-many map- 
ping for each pair of elements. The cost of a warping path is 
calculated by the sum of cost of each mapping pair. Further- 
more, warping path contains three constraints: 1) Endpoint 
constraint: The alignment starts at pair (1,1) and ends at 
pair (N,M); 2) Monotonicity constraint: The order of el- 
ements in the path for both X and Y should be preserved 
same as the original order in X and Y respectively; 3) Step 
size constraint: the difference of index for both X and Y 
between two adjacent pairs in the path need to be no more 
than 1 step. In other words, pair (x:,y;) can be followed 
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by three possible pairs including (a41,y;), (@:,yj41) and 
(Ti41, Yj41)- 


Dynamic Time warping (DTW) is a distance measure that 
searches the optimal warping path between two series. Par- 
ticularly, we firstly construct a cost matrix C’ where each 
element C'(i, 7) is a cost of the pair (x;, y,;), specified by us- 
ing Euclidean, Manhattan or other distance function. DTW 
is calculated based on dynamic programming. Initial step of 
DTW algorithm is defined as 


Dd oo if (¢=O0orj7=O0) and? ] 
BIW) = 45 ape 7 


The recursive function of DTW is defined as 
DTW(i— 1,7) + wna: C(t, 7) 


DTW(i,j —1) + we: C(i, 3) 
DIWG 1,4 = 1) 4a CG) 


DTW (i, 7) = min 


Where wp, Wy, Wa are weight for horizontal, vertical and di- 
agonal direction respectively. DTW (i, 7) denotes distance or 
cost between two sub sequences {21,...,2;} and {y1,..., y;}, 
and DTW (N, M) indicates total cost of the optimal warping 
path. 


In equally weighted case (wp, Wy, Wa) = (1,1,1), the recur- 
sive function has the preference on diagonal alignment di- 
rection because the diagonal alignment takes one-step cost 
while the combination of a vertical and a horizontal align- 
ment takes two-steps cost. In order to counterbalance this 
preference, we can set (wn, Wy, Wa) = (1,1, 2). 


3.1.2. Normalized DTW 


One potential issue of using the original DTW definition is 
that the longer the two sequences are, the larger their D'T'W 
value will be. Thus, its absolute value may not truly reflect 
the difference of the two sequences. Thus, we propose the 
normalized DTW, defined as dividing original DTW by the 
sum of lengths of two sequences as shown below: 


Each alignment in the warping path has a corresponding 
weight, selected from (wp, Wy», Wa) and the sum of weights for 
all alignments equals to the sum of lengths of two sequences 
(N+ M). Therefore, the normalized DTW evaluates the 
average distance of alignments in the warping path for two 
sequences. We will empirically compare the effectiveness of 
the original DTW and our proposed normalized D'TW. 


3.2 Hierarchical Clustering 

Our proposed framework uses Hierarchical Clustering be- 
cause K-means cannot directly applied here. K-means needs 
to calculate the centroid of each cluster while we only have 
the DTW-based distance for each pair of trajectories. 


To apply Hierarchical Clustering, we explore four linkage 
functions: average, median, complete and ward, which de- 
termine how to merge clusters based on the distance between 
the clusters. Our results show that the first three linkage 
methods generate extremely unbalanced clusters while the 
ward linkage discovers relatively balanced ones. Therefore, 
in the following, we will report our results using ward linkage 
only. 
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The optimal number of cluster is selected based upon the 
measurement called WCSS (within cluster sum of squares) 


[16] defined as 


wa=5> S> ST (ai, 25") 


k=1 C(i)=k C(i/)=k 


Our results show that the optimal number of clusters is 4. 


4. EXPERIMENT 


4.1 Training Datasets 

Our datasets were collected by training students on a logic 
DT tutor across three semesters: 2015 Fall, 2016 Spring and 
2016 Fall referred as DT15F, DT16S and DT16F respec- 
tively. For each semester, students were randomly assigned 
into different conditions based on the pedagogical strategies 
employed by the tutor. Pedagogical strategies were policies 
used to decide whether give Problem Solving (PS) or Worked 
Examples (WE) as the next problem. In WE, students were 
given a detailed example showing the expert solution for the 
problem. In PS, by contrast, students were tasked with solv- 
ing a particular problem. For different versions of DT's, we 
applied different types of data-driven approaches to induce 
pedagogical strategies [15]. There were a total of four, six 
and five conditions for DT15F, DT16S and DT16F respec- 
tively. One-way ANOVA results showed that there was no 
significant difference on PostTest scores among conditions 
across all three semesters: F'(158,1) = 0.728,p = 0.537 
for DT15F, F(196,1) = 0.644,p = 0.667 for DT16S and 
F(188, 1) = 0.445, p = 0.776 for DT16F. More details were 
eliminated due to the limitation of space. While no signif- 
icant was found among different conditions, different peda- 
gogical policies resulted in quite different student-system in- 
teractive trajectories and our goal was to investigate whether 
the proposed temporal clustering framework would be more 
effective to predict PostTest scores and to discover the true 
temporal patterns during student training than the condi- 
tion. 


To best describe student learning trajectory, we considered 
the following 36 continuous features which could be grouped 
into three categories: 


1 Autonomy (AM): the amount of work done by the 
student: such as the number of problems solved so far 
(PSCount) or the number of hints requested (hintCount). 


2 Temporal Situation (TS): the time related information 
about the work process: such as the average time taken 
per problem (avgTime), or the total time for solving a 
problem (TotalPS Time). 

3 Student Action (SA): the statistical measurement of 
student’s behavior: such as the number of non-empty-click 
actions that students take (actionCount), or the number 
of clicks of applying rules for logic proof (AppCount). 


To fully evaluate our proposed framework, we explored three 
types of granularity: 1) Problem granularity considered 
students’ behaviors problem by problem. When training 
on DT, the number of problems that each student solved 
differed greatly and as a result, the length of student inter- 
active sequences varied. For example, about 8%, 4% and 
1% of students had more than 40 problems in their interac- 
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Hierarchical Clustering with Ward Linkage 


Dr Level Problem Level Session 

Normalized D'TW DTW Normalized D'TW DTW Euclidean Euclidean 

3 5.05(.027)* 6.48(.012)* 3.84(.051). 1.40(.238) 1.30(.256) 2.26(.135) 

DTI5F A 10.3(.001)** 1.18(.279) 5.86(.017)* 7.67(.006)**  0.75(.388) 2.96(.087). 

D 6.06(.015)* 1.71(.193) 4.03(.046)* 2.55(. eee 1.93(.166) 1.81(.181) 

6 3.19(.076). 1.37(.244) 3.79(.053). 0.50(.480) L2V0.272) 0.76(.385) 

3 12.4(.000)*** 0.63(.427) 8.94(.003)** 1.89(.171) 0.41(.521) 1.00(.318) 

DT16S 4 13.7(.000)*** 0(.995) 10.0(.002)** 2.49(.117) 0.99(.319) 0.38(.536) 
5 7.1(.008)** 0.84(.359) 6.36(.013)* 3.75(.054). 0.39(.532) 15.8(.000)*** 
6 3.11(.079). 0.05(.821) 0.53(.466) 0.67(.412) 0.06(.806)  8.33(.004)** 

3 0.28(.594) 0.96(.328) 0.94(.333) 2.38(.124) 0.89(.344) 2.90(.090). 

DT16F 4 3.93(.049)* 1.97(.163) 2.61(.108) 3.32(.070). 0.52(.471) 1.14(.288) 

5 4.76(.030)* 3.64(.058). 2.64(.058). 1.74(.189) 1.65(.201) 0.06(.798) 

6 3.95(.048)* 9.67(.002)** 2.27(.134) 1.92(.168) 1.27(.261) 0.0(.997) 


Note: significant codes: 0.000 :‘****’; 0.001: ‘***’; 0.01: ‘**’; 0.05: °*’; O.1: ‘.’ 


Table 1: One way ANOVA using PostTest score as dependent measure and cluster as a factor 


tive sequences in DT15F, DT16S and DT16F respectively. 
2) Level granularity summarized students’ behaviors for 
each level as a singe feature vector; since DT has six levels, 
the length of level interactive sequence is six for each stu- 
dent. 3) Session granularity summarized the students’ 
entire training behaviors by a single feature vector. 


Furthermore, there were 158, 196 and 188 students that par- 
ticipated in DT15F, DT16S and DTI16F respectively. Com- 
bining semesters with three types of granularity, we had a 
total of 9 data sets. 


4.2 Data Preprocessing 

Our data-preprocessing involved two steps: 1) Standardiza- 
tion. 'To ensure that our state features measured at different 
scales would contribute equally to the distance functions, 
we standardized all features by subtracting mean and di- 
viding standard deviation; 2) Principle Component Analy- 
sis (PCA), which is widely used for dimensionality reduc- 
tion. PCA is able to generate mutually independent princi- 
ple components (PCs) which cover the majority of variance 
information. We selected PCs with the corresponding vari- 
ance larger than 1, thus 6-8 PCs were chosen for different 
training data sets. 


4.3 Clustering Process 

While most of previous clustering research on sequential tra- 
jectory used the entire trajectory, we investigated whether 
it was more effective to only use sub-sequential trajectories 
rather than the entire trajectories. ‘This was especially im- 
portant because we wanted to identify students with differ- 
ent learning patterns, especially the students with unprof- 


itable learning as early as possible so personalized learning 
could be offered. 


To do so, we recursively generated our nine training datasets, 
three types of granularity across three semesters, using sub- 
sequential trajectories from the beginning of the training up 
to each of the six levels separately. For example, ‘Level4- 
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Problem-DT16S’ training dataset was generated by using 
problem-by-problem trajectories from the beginning of train- 
ing process up to level 4 using DT16S. Then we followed the 
following three steps: 


Distance matrix. We explored three types of distance 
matrices: D'TW, normalized DTW and Euclidean distance. 
Euclidean distance was used as the baseline here. 


Outlier Detection. Given that many clustering methods 
are often sensitive to outliers, we applied filtering approach 
to remove them from our training data. More specifically, 
for each type of distance matrix, we calculated the average 
distance for each student to all others and then obtained the 
mean pt and standard deviation o for all students’ average 
distances. We filtered out students whose average distances 
were larger than: u+2x*o. 


Cluster Evaluation. We applied Hierarchical Clustering 
on distance matrices calculated above, and used Post'lest 
scores to evaluate the effectiveness of the resulted clusters. 


5. RESULT 


As mentioned above, while the assigned condition did not 
seem to be a crucial factor to predict student PostTest scores, 
we explored whether our proposed temporal clustering frame- 
work could do better. 


5.1 Cluster Evaluation 

Table 1 summarized clustering results. In Tablel, each row 
denoted clustering results of using student interactive sub- 
sequential trajectories, varying from using the first three lev- 
els up to the entire six levels. For instance, ‘Level 4’ used 
sequential data or summarized data points from the begin- 
ning of training process up to level 4. Note that we did 
not get good clustering results when using only the first two 
levels so their results were eliminated from the table. This 
was probably because there were a lot of noises in the first 
two levels as some students were still getting used to the 
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Dependent Measures: F’-ratio(p-value) 


DT #Student 


Post'Test Interaction WrongApp  hintCount  avgstepTime avg’ Time Total'Time 

DTI15F 155 10.31(.001)  40.54(.000) 21.55(.000)  6.79(.010) 0.01(.919) 17.15(.000) 20(.000) 

DT16S 190 13.69(.000)  47.59(.000) 67.47(.000) 99.73(.000) 2.77(.097) 28.76(.001)  36.21(.000) 

DT16F 178 3.93(.048) 2.28(.133) 0.16(.691) 5.99(.015) 13.45 (.000) 0.31(.58) 0.20(.655) 

Dependent Measures: Mean(Standard Deviation) 

DT Cluster Size Post'Test Interaction WrongApp hintCount avestep Time ave'Time TotalTime 
(score) (count) (count) (count) (sec) (min) (hour) 

Cl AT 84.84(21.64) 1052(432) 80(47) 21(34) 6.01(1.86) 5.45(2.31) 1.75(0.73) 

DTI5F C2 26 76.92(26.02) 1259(662) 110(79) 44(45) 10.88(3.21) 11.47(5.52) 3.76(1.97) 

C3 55 72.35(28.77) 2021(752) 214(155) 76(61) 8.31(3.00) 12.64(5.14)  4.49(1.76) 

CA 27 ~=©66.58(24.49) 1706(600) 154(101) 26(30) 5.48(1.73) 7.88(3.28) 2.60(1.04) 

Cl 112 91.04(16.53)  1242(519) 104(64) 13(12) 5.89(2.32) 5.98(3.60) 2.06(1.22) 

DT16S C2 Al 83.99(23.83) 1483(660) 140(91) 22(16) 9.37(3.73) 10.72(4.33) 3.66(1.42) 

C3 14 70.98(27.14)  2186(551) 275(170) 39(28) 5.04(1.84) 8.84(4.50) 3.16(1.73) 

C4 23 78.66(26.08) 2058(994) 278(205) 65(52) 6.81(2.08) 10.91(8.07)  4.05(2.98) 

Cl AO 79.61(20.67) 1216(500) 122(92) L721) 8.76(2.53) 9.11(5.43) 3.15(1.94) 

DT16F C2 44 88.21(16.35) 1713(867) 147(98) 16(15) 4.19(0.94) 5.71(2.93) 2.03(1.09) 

C3 35 78.57(25.87) 2335(887) 276(182) 43 (34) 6.26(1.74) 11.25(4.62) 4.09(1.84) 

C4 59 90.09(13.95) 1440(528) 116(66) 25(28) 5.99(1.48) 7.12(3.30) 2.47(1.19) 


Table 2: result of one way anova on dependent measurements for best clustering assignment 


tutor. Each cell in Table 1 denoted one-way ANOVA results 
using Post'Test score as the dependent measure and clus- 
ters as the factor in the format of F-ratio(p-value). The 
bold numbers showed that significant differences were found 
among clusters on Post’'Test scores. Each column represented 
different types of granularity using different distance func- 
tions: DTW, normalized D'TW and Euclidean. For problem 
granularity, we only applied DTW and normalized D'TW 
approaches because Euclidean distance could not be applied 
on sequential trajectories with different lengths. For level 
granularity, we utilized all three distance functions. Note 
that when calculating Euclidean distance, we first calcu- 
lated distance for each level separately and then summed 
them up. For session granularity, all three distance func- 
tions were equivalent in that all became Euclidean distance. 


Granularity Comparison. Table 1 showed that among 
three types of granularity, problem granularity was most 
suitable for clustering because significant differences were 
found across all three datasets and across all levels of sub- 
sequences on Post’Test scores when using problem granular- 
ity. This finding was consistent with our hypothesis that 
directly clustering student moment-to-moment fine grained 
trajectories indeed provide benefit to discover the underline 
characteristics of student learning processes. 


Distance Function Comparison. To compare the three 
distance functions, we only focused on the level granularity 
since it was the only one that involved all three distance 
matrices. Table 1 showed that both original and normalized 
DTW outperformed Euclidean distance because no signif- 
icant differences were found among the clusters using Eu- 
clidean distance. To compare the two types of DTW, we fo- 
cused on both problem and level granularity. Table 1 showed 
normalized D'TW could induce more robust and consistent 
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results than DTW. In short, among the three distance func- 
tions, our proposed normalized D'TW was the best. 


Sub-sequences Comparison. Table 1 showed that con- 
sistently significant results were found for all problem gran- 
ularity data sets using normalized DTW and sub-sequential 
trajectories up to first four or five levels. Interestingly, using 
the entire sequential data may be not as effective as using 
sub-sequences in that for DT15F and DT16S datasets, no 
significant difference was found when using problem granu- 
larity on the entire trajectories. 


Variable | Definition 


PostTest | the score of student’s post test 
Interaction | number of student’s actions 
WrongApp | number of wrong application of rules 

hintCount | number of hints that students take 
avestepTime | average time per step 

ave'Time | average time per problem 
TotalTime | time of completing the training process 


Table 3: Variables and Definitions 


5.2 Clusters Analysis 

Table 1 showed that the consistent significant results was 
found when we clustered on problem granularity using nor- 
malized DTW on sub-sequences from beginning of training 
process to the level 4 across the three semesters’ datasets. 
Therefore, in the following, we will shed some lights on char- 
acteristics of the discovered clusters. 


Table 2 showed one-way ANOVA results on seven depen- 
dent measures using clusters as the factor. Particularly, 
we bolded p values which were less than 0.05. We found 
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that there was significant difference on all variables except 
avgstepTime for DT15F and DT16S. Additionally, signifi- 
cant difference existed on three variables including Post Test, 
hintCount and avgstepTime for DT16F. In order to investi- 
gate how much difference existed among clusters based on 
selected variables, we presented the mean and standard de- 
viation for each pair of cluster and variable in Table 2. We 
highlighted the mean of variables that were significantly dif- 
ferent from others, either extremely large or small. We ana- 
lyzed the difference among clusters for three semesters sep- 
arately shown as follows. 


1. DT15F. Cl had the highest PostTest while C4 had the 
lowest one among four clusters. Cl had the lowest Jnter- 
action, WrongApp, hintCount and TotalTime among four 
clusters. Although C2 and C3 had similar PostTest, C2 
contained dramatically larger Interaction, WrongApp and 
hintCount than C3. Furthermore, C3 had the largest value 
of Interaction avgTime and TotalTime. 


2. DT16S. Cl had the highest PostTest and the lowest 
Interaction, on the contrary, C3 had the lowest PostTest 
and the highest Interaction among four clusters. Although 
C2 and C4 had the closed PostTest, C4 contained higher 
WrongApp and hintCount than C2. 


3. DT16F. C4 had the highest PostTest, while C3 had the 
lowest one. Although C2 performed closed to C4, C2 had 
higher WrongApp than C4. Furthermore, C1 had the lowest 
Interaction and the highest avgstep Time while C3 contained 
the highest Interaction and WrongApp. 


In short, our results showed that our discovered clusters in- 
deed had the distinctive interactive patterns and could pre- 
dict students Post'Test better than their assigned conditions. 


6. CONCLUSIONS & FUTURE WORK 


In this paper, we proposed the temporal clustering frame- 
work to directly cluster student interactive trajectories. Par- 
ticularly, we explored three different distance functions and 
three types of granularity. Results showed that normal- 
ized DTW is the most effective function for generating dis- 
tance matrix; problem granularity is more effective than 
level and session granularity. More importantly, through 
clustering statistical analysis, we were able to identify dis- 
tinctive patterns among clusters during the learning process, 
which could provide benefit to the personalized learning. For 
the future work, we will modify distance matrix by combin- 
ing kernel function with DTW approach given sequential 
data containing both continuous and discrete features in or- 
der to generate effective distance matrix. 
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ABSTRACT 


Characterizing the nature of students’ affective and emotional states 
and detecting them is of fundamental importance in online course 
platforms. In this paper, we study this problem by using discussion 
forum posts derived from large open online courses. We find that 
posts identified as encoding confusion are actually manifestations 
of different learner affects pertaining to their informational needs— 
primarily seeking factual answers. We quantitatively demonstrate 
that the use of content-related linguistic features and community- 
related features derived from a post serve as reliable detectors of 


confusion while widely outperforming currently available algorithms 


of confusion detection. We also point out that several prediction 
tasks in this domain (e.g., confusion and urgency detection) can be 
correlated, and that a model trained for one task can effectively be 
used for making predictions on the other task without requiring la- 
beled examples. Finally, we highlight a very significant problem of 
adapting the classifier to unseen courses. 


Keywords 


Confusion characterization, discussion forum analysis 


1. INTRODUCTION 


Discussion fora constitute a central feature of learner interaction in 
online course platforms, where learners post questions, opinions, 
and concerns, which are viewed, rated and answered by fellow- 
learners and/or teaching staff. In the particular instance of courses 
affording only virtual interactions, such as at-scale learning envi- 
ronments, forum posts constitute rich repositories of students’ af- 
fective and emotional states captured in real time. The focus of 
this study is on characterizing the nature of students’ affective and 
emotional states, manually identified as confusion in forum posts 
and developing automatic methods to detect them. Here, as in [25] 
and [2], we operationalize the definition of confusion as a state in 
which a student hits an impasse and is uncertain of how to move 
forward. As such, the reasons for confusion could be attributed to 
lack of clarity on the topic discussed or technical shortcomings of 
the learning interface, among others. Examples of such posts are 
shown in Table 1. 
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Table 1: Posts representing confusion and its absence. 
I have also problems with the section “Pre course Survey” 

I have completed this section several times about 10, I have 

the final message “Thanks” but at each new connection appears 
in my courseware “pre course Survey (please complete)” Please 
help me, what I have to do ? (Confusion) 

Interesting! How often we say those things to others without 
really understanding what we are saying. That must have been a 
powerful experience! Excellent! (No confusion) 


The strong connection between learner affect, engagement, and 
learning outcomes has long been understood but studies on their 
effect on continued participation in internet-based learning envi- 
ronments such as MOOCs is only emerging (e.g.,[25, 2]). In ad- 
dition to constituting supporting evidence to understand this asso- 
ciation, mechanisms to automatically detect learner affect encoded 
via confusion in discussion fora serve the following ends. Firstly, 
they inform us about the aspects of a course that are frustrating for 
learners and hence need improvement [24, 21, 11]. Second, they 
can aid a timely and accurate intervention to struggling learners by 
providing critical insights into their emotional states[25], eventu- 
ally leading to success of this critical course component. 


For instance, when a student expresses confusion or misunderstand- 
ing about a concept, the immediacy with which the confusion is ad- 
dressed impacts student satisfaction and course progress. Because 
of this, and the demands of an at-scale learning environment, effi- 
cient and automatic detection of confusion has become more im- 
portant than ever before. With a steady increase in the number of 
courses on online course catalogs, and with limited means to con- 
trol the instructor-to-student ratio in online platforms, the problem 
of detecting confusion as expressed in online fora is timely. Despite 
the critical need, relatively few studies analyze confusion in course 
discussion forum posts [25, 2]. 


While the explicit purpose of discussion fora is to engage the users 
in a way that develops a sense of community and communication 
within large-scale online courses, the posts themselves serve as 
proxy for learner affect and emotions expressed in various forms. 
Detecting this encoded affect from posts is an important challenge 
for natural language processing algorithms. This is because, at the 
outset, a post indicating confusion could be construed to be a ques- 
tion. Since question posts and confusion posts—forms of informa- 
tion seeking behavior—are remarkably similar, one would expect 
that approaches to detect questions (e.g., [7]) ought to be directly 
applicable. However, this is not always the case. Many times con- 
fusion posts do not have an explicit question making the two prob- 
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lems of question detection and confusion detection closely related 
but not the same. This makes the detection of confusion in a post 
a non-trivial problem partly because, for posts containing a ques- 
tion, the questions tend to occur with other declarative sentences. 
A second difficulty is the use of different question styles (informal, 
where standard features such as the question mark are likely to be 
absent or where the question is worded without a question mark). 
Hence, simple heuristics of using question mark or 5W1H words 
(who, what, which, where, why, how) are rendered inadequate. 


Additionally, as observed in [18], finding patterns to identify non- 
questions is more challenging than finding patterns in questions 
(since they usually do not share common lexical and/or syntactic 
patterns). This is directly applicable to confusion posts where posts 
not indicative of confusion have diverse intent. 


Prior studies in this direction (e.g., [6, 2, 25]) have led to the use 
of linguistic and structural features available from the discussion 
forum. While similar in spirit to these prior studies, this study sets 
itself apart from them in many ways. Firstly, we identify that confu- 
sion detection is different from simple/complex question detection. 
In order to solve this problem more effectively, we point out that the 
community needs a characterization of confusion instead of treat- 
ing it as yet-another text-classification task. We present an in-depth 
analysis of types of “confused posts’ using high-quality and reli- 
able manual annotations (Section 4). Motivated by this analysis, 
we then design features to detect confusion automatically in a su- 
pervised framework. We also point out that several prediction tasks 
in this domain (such as confusion and urgency detection) are corre- 
lated, and demonstrate that a model trained for one task can effec- 
tively be utilized for making predictions on the other task without 
requiring labeled examples. Finally, we highlight a very significant 
problem concerning the applicability of such classifiers to unseen 
courses. We summarize our contributions below: 


Characterizing affective states and informational needs: We ob- 
serve that nearly half of the posts encoding confusion and consid- 
ered urgent pertain to users seeking answers to factual questions. 
Aside from indicating an information need, these posts are also 
used to report course-specific issues such as concerns with assign- 
ments or quizzes as well as to report course-related technical issues 
(e.g., unavailability of a lecture video or a peer-assessment grade). 


Efficient confusion detection: We quantitatively demonstrate that 
our use of content-related linguistic features of a post and a set 
of community-related features associated with it serve as reliable 
detectors of confusion while widely outperforming currently avail- 
able algorithms of confusion detection. 


Combined confusion and urgency detection: We show that the 
trained confusion classifier also functions as an efficient urgency 
detector when tested on confusion posts also labeled as ‘Urgent’. 


Scaling the effort to other courses and domains: Based on the 
dataset, we make concrete suggestions to explore domain adapta- 
tion towards building course-generic classifiers. Rather than aim- 
ing for course-independent classifiers, our proposal is to harness the 
utility of available course-specific classifiers for an unseen course, 
based on suitably defined cross-domain similarities. 


By means of a thorough quantitative evaluation of our proposed 
features in a supervised machine learning model, we demonstrate 
its effectiveness as a scalable and efficient model for automatic de- 
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tection of confusion that generalizes well to unseen courses. 


2. RELATED PRIOR WORK 


Confusion and its impact on learners: Studies modeling con- 
fusion and exploring its relation to learner affect have found that 
even though students seem to struggle when confused, the situation 
leads them to attempt to resolve barriers to their understanding of 
complex concepts [16, 10, 8]. However, it has also been pointed 
out that remaining confused has a negative effect that leads to stu- 
dent disengagement and eventual dropout, thus making it impera- 
tive that confusion be resolved immediately [15, 25]. This neces- 
sity is more immediate in the context of learning at scale given the 
impersonal and the distant nature of the learning process[14, 19]. 
Thus, detecting learner affect, particularly with respect to under- 
standing the material has the potential to contribute to the design 
of interventions as shown in prior studies (e.g.,[9, 22]) can lead to 
increased learning effectiveness in computer-based learning envi- 
ronments such as online courses. 


Detecting confusion: Focusing on MOOCs, where the only venue 
for learner-instructor interaction is the discussion forum, studies 
are now beginning to explore automated mechanisms to provide 
timely learner support by analyzing forum content. These include, 
predicting when instructor intervention is needed [5, 6], monitoring 
student’s opinion towards the course [20], recommending questions 
to users for assisting students seeking answers [23], identifying ac- 
ceptable answers [13], organizing the forum content into aspects or 
topics along with their sentiments to help instructors in promptly 
addressing common issues [17], identifying posts that express con- 
fusion to predict points of eventual student dropout [25], and de- 
tecting posts that express confusion to then map confused posts to 
course video clips as a way to automate interventions [2]. A com- 
mon feature of these approaches to detect confusion is their reliance 
on textual and structural features of the discussion forums to design 
effective algorithms. 


While [25] uses a set of linguistic features to detect confusion, it 
disregards the structural features (e.g. the number of times a post 
has been read or the number of up-votes) that are found to be use- 
ful in detecting the informational need or urgency [6], [2] uses a 
set of structural features in combination with a linguistic feature 
in addition to also relying on the other dimensions of a post, such 
as expression of a sentiment and the sense of urgency. This latter 
reliance on the other dimensions is not realistic given the manual 
effort of assigning the labels for sentiment and urgency (needed 
to design corresponding classifiers). Our study shares similarities 
with these prior studies in that we rely on the discussion forum 
information, but differs from them by the use of a novel set of fea- 
tures that encode content-related aspects of forum posts to account 
for and structural aspects of the forum posts. 


We compare the performance of our detection approach to that in 
[2] and show that our approach outperforms current state-of-the- 
art by a wide margin both in-domain and across course domains. 
In addition, differing from prior work, we show that our confusion 
classifier can simultaneously detect urgency, thereby addressing the 
need for immediacy for learning effectiveness. 


3. DATA DESCRIPTION 


The forum posts analyzed in this study are from the Stanford MOOC 
Posts dataset, a corpus composed of 29,604 anonymized learner fo- 
rum posts from eleven Stanford University public online classes 
[1]. The posts are taken from three course domains: Humani- 
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Table 2: Summary of posts from the three discussion forums 


vategory Hosts Not Confused ontused ontused & Urgent (% No. of sentences aS os mean, sd 


Tumanities a 


ties/Sciences, Medicine, and Education, with about 10,000 posts 
in each set. 


A salient feature of the dataset is that each post is available with 
manually assigned labels for six dimensions indicating confusion, 
urgency, question, opinion, answer, and sentiment. We encourage 
the readers to refer to [1] for more details. In our study, we only 
consider the dimensions of Confusion and Urgency: 


Confusion - encodes the extent to which the post expresses confu- 
sion, on a scale of 1 (expert knowledge) to 7 (extreme confusion); 


Urgency - denoting the extent to which the post is interpreted to 
be urgent and requires that an instructor respond to the post with 1 
denoting ‘not urgent at all’ and 7 denoting ‘extremely urgent’; 


We divide the posts into two groups—“confused” and “not con- 
fused” based on their gold Confusion scores. A score above 4 is 
considered a Confused post, whereas a score below 4 is regarded 
as a Not confused one (we disregard posts with score = 4 from 
the analyses). Likewise, an Urgency score above 4 is regarded as 
an Urgent post, whereas a score of 4 and below is regarded as a 
non-Urgent post. A summary of the data set 1s provided in Table 2. 


4. CHARACTERIZING CONFUSION 


To understand how confusion is expressed in forum posts, two of 
the authors independently coded a random sample of 200 posts 
from the entire data set for the following 6 types: 


1. Factual, if the post seeks clarification of a factual aspect of 
the course material, as in the post, “Does this mean logis- 
tic regression always gives adjusted ratios and the manually 
computed ratios are unadjusted?” 

2. Course-specific, if the user seeks a course-specific clarifi- 
cation, such as “Dear Staff, Can you give atleast 2 attempts 
for each quiz. Giving only one attempt is making us loose 
interest in the course. Kindly consider.’ 

3. Course-technical, if the user seeks clarification on technical 
aspects of the course. For example, “I am trying to download 
5.R.RData, but I cannot open it, can please let me know how 
I can open this file. With kind regards,” 

4. Recommendation, if the user is seeking a recommendation. 
For instance, consider the following post. “another question 
would you use this form throughout the whole essay? or 
would you shorten it after using the full phrase?” 

5. Frustration, where the user expresses frustration, as in, “I 
had the same issue. Am I bad at finding the check button and 
bad at math???” 

6. Other, for posts that belong to none of the above 5 types. 


The inter-rater reliability, k, was 0.81. Based on the instances 
where both coders agreed, we characterize the type of posts. True 
to the fact that the discussion forum is an avenue for learners to seek 
learning support from fellow learners, the most popular post type is 
Factual (54% of the annotated posts), where learners seek to clarify 
their misunderstandings of concepts presented in the course. This 
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post type is then followed by Course specific (27%) and Course 
technical (12%). The remaining posts were categorized as Recom- 
mendation (3%), Frustration (2%) and Other (2%). 


Overall, these observations confirm that posts indicative of confu- 
sion need to be addressed in a timely manner; even though some 
of them may not be explicit questions, they echo the information 
seeking nature and the uncertainty encoded in posts that are ex- 
plicit questions. Additionally, we hypothesize that the inherent dif- 
ference in the nature of affective states encoded as confusion could 
be responsible for the inconclusive nature of the effect of confusion 
on learning outcomes (e.g., confusion positively impacting learning 
in [10] and negatively impacting outcomes in [25]. 


5. DETECTING CONFUSION 


Our next focus is on building a confusion detector that will allow 
for automatic identification of confusing posts to facilitate immedi- 
ate response thereby enhancing the learning experience and reduc- 
ing learner frustration. Towards this end, the confusion-detection 
features can be grouped into two categories: content-related and 
community-related features. 


Content-related features: These features analyze the textual con- 
tent of the post: 


1. Automated readability index (ARI): Readability indices are 
designed to measure how understandable a piece of text is. 
We hypothesize that the posts encoding confusion, owing 
to their information seeking nature as well as owing to the 
tendency of learners to post verbatim course content, have 
higher readability indices (1.e., are more difficult to read) than 
those posts that do not encode confusion. 

2. Post length in words; 

3. Unigrams: These binary features encode whether a word oc- 
curred in the post or not. 

4. Topicality (LDA): These features use supervised Latent Dirich- 
let Allocation (LDA) [4] to generate the LDA labels as fea- 
tures. Towards this, we first perform a preprocessing step in- 
volving stop-word removal (including numbers and punctua- 
tion); stemming; and removing high-frequency (top 1%) and 
low-frequency words (occurring fewer than 5 times). Then 
a supervised LDA (sLDA) model is obtained with the con- 
fusion labels. Here we use the confusion labels for each 
post to obtain two sets of LDA words (associated with pres- 
ence/absence of confusion). This model predicts a label (con- 
fusion or not) based on the words in the post that occur in the 
respective LDA set. 

5. Question mark: Since confusion is often expressed via ques- 
tions, this feature checks for presence of a question mark. 


Community-related features: A second set of predictors of whether 
a post encodes confusion or not is obtained by observing how the 
community of learners reacts to a post. In particular, a post that is 
of general interest to learners (such as one that is seeking a factual 
clarification, or that seeks resolution for a course-related technical 
problem) would be read by several viewers, thus leading to a rela- 
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Table 3: Performance of our approach and the two baselines. ‘NR’ stands for results that were not reported in the respective paper. 


[Model Cohen’s Kappa 


Our eI ——— 84.38 90.38 77.16 83.14 0.69 
Humanities | Unigrams Model[3] 71.99 71.00 82.21 T3.28 0.44 
YouEDU[2] NR 77.80 64.20 70.00 0.62 


Our Model 80.04 79.44 81.02 80.00 0.60 
Education Unigrams Model|[3] 82.03 78.76 87.81 82.96 0.64 
YouEDU[2] NR NR NR 38.30 0.36 


Our Model 83.75 86.67 80.14 $3.16 0.67 
Medicine Unigrams Model|[3] 70.39 72.82 65.33 68.69 0.41 
YouEDU[2] NR 69.90 58.90 62.70 0.56 


tively higher number of reads. Likewise, posts encoding confusion 
are considered important resulting in higher up-votes. Accordingly, 
our set of features includes the number of (1) reads and (i1) up-votes 
of the post. 


We cast the task of confusion detection as one of binary classi- 
fication, where posts expressing confusion constitute the positive 
class. For the purpose of this study we do not use the confusion- 
types identified in the characterization. We trained an Elastic-net 
model, which is a regularization approach that uses a mixture of L 
and L> penalties to perform variable selection [26]. 


6. EXPERIMENTS 


Datasets: From Table 2 we can see that for majority of the courses, 
the data is biased towards the negative (not-confusion) class. This 


makes learning difficult, especially for the positive (confusion) class. 


In order to alleviate this problem, for each course, we down-sample 
the negative class (randomly) such that the two classes are bal- 
anced. Additionally, forum posts from ‘Education’, contains very 
few (640) confusion posts. This resulted in a very small resampled 
dataset for this course (compared to the posts in Humanities and 
Medicine) after down-sampling the negative class. Noting that this 
dataset was prone to over-fitting due to very few posts as compared 
to the number of features, we up-sampled the positive class to twice 
its original size before down-sampling the negative class as before. 


We also tokenized the content of the posts; removed stopwords 
(175 unique words); stemmed [12]; and removed infrequent words 
(with count less than 5). The final vocabulary lists for these courses 
contained about 2400, 1400, and 1750 words respectively. 


Evaluation Measure: From the perspective of helping students, 
the positive (confusion) class, indicative of learner affect, is more 
important than the negative class. An ideal classifier would, there- 
fore, identify all confusion posts bringing them to the instructor’s 
attention (high recall for the positive class). Additionally, a high 
precision for the positive class is also important so that the instruc- 
tor’s efforts are not wasted in analyzing false-positives. Therefore, 
it seems natural to evaluate models using the F-measure of the pos- 
itive class (in-line with related prior work). For the sake of com- 
pleteness, we also report accuracy and Cohen’s Kappa. 


6.1 Confusion Detection 

Table 3 compares 10-fold CV results of our model with two promi- 
nent baselines: (4) Unlike our model, our first baseline [2] uses 
manual annotation for dimensions such as Opinion and Question 
(apart from ground truth confusion labels for training). We include 
their performance as reported in their paper. (41) The second base- 
line [3] uses only Unigram features. We replicated this baseline 
in our experiments. Also, a random baseline would get a score of 
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50%. However, we do not include this result in the tables for clarity. 


We can see that, for Humanities and Medicine, our model performs 
significantly better than the baselines. For instance, for the Hu- 
manities course, our model achieves 10.4% and 18.8% relative im- 
provements in F-measure over the two baselines. Similarly, on the 
Medicine course, our model achieves 21.1% and 32.3% relative 
improvements in F-measure. Our model’s Cohen’s Kappa (and ac- 
curacy when reported) are also better than the baselines. This in- 
dicates the utility of our features in not only learning the positive 
class, but also performing well on the overall classification task. 


For the Education course, our model outperforms the YouEDU[2] 
model significantly. Our model achieves an F-measure of 80.0% as 
opposed to only 38.3% by the YouEDU model. We would like to 
remind the reader that the data for the Education course was partic- 
ularly skewed towards the negative class (not-confusion) with only 
6.5% of the posts belonging to the positive class (confusion). This 
stark difference in performances of the two models, emphasizes the 
need for models that can pay particular emphasis on the minority 
class, which in this case is more significant than the majority class. 


Interestingly, for this course, the performance of our model is com- 
parable to the unigrams model [3], with the latter performing slightly 
better. Both the models use the same dataset and so neither suffers 
from the rare-class problem. The seemingly disadvantageous na- 
ture of our features for this course is not consistent with the results 
obtained for the other two courses, and requires further investiga- 
tion. However, in general, the features proposed in our approach 
provide a considerable boost in performance. 


6.2 Effect of Degree of Confusion 


As mentioned in the data description, the dimension of Confusion 
was annotated on a scale of 1-7 (denoting the degree of confusion), 
which could be potentially construed to correspond to a scale of 
affective states. While we had conflated all the positive confusion 
levels (rep. negative levels) for the purpose of detection, here we 
evaluated the performance of our detector on its ability to detect 
the degree of confusion. We examined the performance (here, ac- 
curacy) at every Confusion degree and report the results in Table 5. 
We observe that the accuracy monotonically increases with confu- 
sion level, suggesting the classifiers suitability for real applications 
(e.g., potentially informative to instructional designers). 


6.3 Feature ablation analysis 

Table 4 compares the predictive importance of our various features 
by removing them one at a time. For convenience, the first row for 
each course depicts the performance with the full feature set (same 
as Table 3). From the table, ‘Unigram’ and ‘Question-mark’ seem 
to be the most valuable. For instance, the model for Education re- 
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Table 4: Feature ablation. For each course, the top row corresponds to the complete feature set. The subsequent rows represent 
performance with one of the features removed. Removing any feature (except ‘LDA’) decreases performance, indicating its utility. 


Removed Feature Cohen's Kappa 
=i None YETI | TC 

0.54 

0.55 

). 


71.24 82.96 
77.40 83.06 
40 52.9 


; Jumber of Reads $4.16 $9.86 
Community-related 849A 89 86 
AR 34. 59.66 


Humanities Post Length 


Unigrams 
LDA 
Question Mark 


Content-related 


= fone [30.07] 


mberorR 
Community-related 
Score 
A iQ 


Post Length 
Unigrams 
LDA 

Question Mark 


= one 


: Number o 
Community-related f 
Score 


Post Length 
Unigrams 
LDA 

Question Mark 


Education 


Content-related 


Medicine 


Content-related 


Table 5: Accuracy of the model in detecting Confusion at dif- 
ferent levels. Numbers in () show number of instances. Per- 
formance improves with increasing scores. Confusion at levels 
higher than 5.5 did not have sufficient instances. 


Course | 45 7 5 | 55 _| 


4.5 5 5.5 
Education |0.76 (521)| 0.80 (93) | 0.87 (24) 
Humanities |0.69 (463)|0.79 (553)|0.79 (190) 
Medicine |0.71 (641)|0.86 (762)} 0.90(154) 


lies heavily on the Unigram features (removing which decreases 
the F-measure from 80% to 64.7%). Removing any of the other 
features like ‘Number of reads’, ‘Post Length’ also hurt model per- 
formance, albeit to a lower degree. Experiments reveal that the 
inclusion of LDA as a feature hurts more than helping the model’s 
performance. Overall, we can conclude that removing most of our 
features reduces the performance of the model to various degrees, 
indicating their utility. 


6.4 ‘Testing on Unseen Courses 

Our supervised model requires having labeled training data. How- 
ever, considering the short duration of most online courses, man- 
ually annotations for an ongoing course is not only expensive but 
also infeasible due to time and privacy constraints. Hence, domain- 
independence of such classifiers is extremely desirable. In our next 
experiment, we test a given model on an unseen course in order 
to estimate the domain-independence of existing methods. Table 6 
shows the results of this experiment. The last column of the table 
shows the change in model’s performance when tested on a course 
not seen during training. We can see that the model performance 
always decreases when it is tested on a new course. However, the 
decrease can be expected to depend on the difference in the class- 
conditional distributions of the train and the test sets. From this 
perspective, one could argue that the post from Humanities and 
Medicine are more similar to each other than to the posts from 
Education, as far as this task is concerned. From instance, when 
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77.26 
77.88 


a 
0.54 
0.55 
U. 


a model trained on data from Humanities is tested on data from 
Medicine, and vice-versa, the decrease in F-measure is only about 
of 4 points. On the other hand, the model suffers a much greater de- 
crease in performance when it is trained on data from Medicine (or 
Humanities) and is tested on data from Education, and vice-versa. 


This result indicates that domain-adaptation methods, that aim to 
build course-independent classifiers, should not blindly aim for clas- 
sifiers that perform well on all courses. Instead, a more opportunis- 
tic alternative would be based on assessing the similarity between 
the data from the source (training) and the target (testing) courses. 


6.5 Urgency Prediction 

In Table 2 we can see that there is a high correlation between the 
‘Confused’ and ‘Urgent’ labelings. For instance, 86.4% of the 
posts from Humanities labeled as ‘Confused’ are also labeled as 
‘Urgent’. Therefore, it would be of interest to investigate how 
well a model trained for detecting confusion would perform on the 
task of detecting urgency. Table 7 shows the results of this experi- 
ment. For this table we train our model using ground-truth Confu- 
sion labeling, and use the trained model to make predictions on the 
test instances. We then judge model’s performance by comparing 
predicted positive/negative class with the ground truth Urgent/not- 
urgent class. Note that we use urgent/not-urgent labelings only dur- 
ing evaluation and not training. Like before, we are primarily inter- 
ested in the F-measure of the positive (urgent) class. From the table 
we can see that we achieve a reasonably high F-measure especially 
for Humanities (75.78%) and Medicine (80.68%). This suggests 
that for the two related tasks, classifiers trained for one task could 
be used for the other task with little modifications. 


7. FUTURE DIRECTIONS 


We have presented detailed analysis of posts indicative of confu- 
sion from a collection of discussion forum posts from learners on 
online courses spanning 3 domains. Our detailed manual analy- 
sis of the types of confusion posts suggests that subsequent explo- 
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Table 6: Model performance decreases when tested on unseen courses. 


Performance drops indicate a need for more aggressive 


domain-adaptation efforts on diverse epars (like Education- Humanities), as compared to similar ones (Humanities-Medicine). 


Change in F-measure 


90.38 
67.86 
78.95 
79.44 
81.60 
TiAd 
86.67 
87.03 
61.40 


Humanities 84.3 3 
Education Humanities 70.25 
Medicine 79.16 
80.04 
71.88 


70.82 
83.75 
Medicine 81.06 
65.15 


Table 7: Model trained for detecting confusion performs well 
on the Urgency prediction task without using urgency labels. 
Humanities 80.50 72.07 80.59 | 75.78 0.60 
83.02 76.57 85.54 | 80.68 0.66 
61.95 30.13 88.15 | 44.10 0.26 


Education 


Humanities Education 
Medicine 
Medicine 
Humanities 


Education 


Medicine 
Education 


rations could consider more specific models involving dedicated 
components for each of the confusion types. 


Future work could also focus on supplementing our results with 
qualitative analyses, e.g. via interviews of learners, to explore spe- 
cific findings in greater depth. Another related direction for future 
exploration is the inclusion of clickstream information in the anal- 
ysis to afford a broader view of learner-content interactions in the 
presence of confusion. 
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ABSTRACT 


Virtual internships are online simulations of professional practice 
where students play the role of interns at a fictional company. 
During virtual internships, participants complete activities and 
then submit write-ups in the form of short answers, digital 
notebook entries. Prior work used classifiers trained on participant 
data to automatically assess notebook entries from these learning 
environments. However, when teachers create new internships 
using available authoring tools, no such data exists. We evaluate a 
method for generating classifiers using specifications provided by 
teachers during their authoring process instead of participant data. 
Our models rely on Latent Semantic Analysis based and Neural 
Network based semantic similarity approaches in which notebook 
entries are compared to ideal, expert generated responses. We also 
investigated a Regular Expression based model. The experiments 
on the proposed models on unseen data showed high precision 
and recall values for some classifiers using a similarity based 
approach. Regular Expression based classifiers performed better 
where the other two approaches did not, suggesting that these 
approaches may complement one another in future work. 


Keywords 


Automated assessment, text classification, LSA, neural network, 
semantic similarity, regular expressions, 


1. INTRODUCTION 


Recently, authoring tools have been developed that let teachers 
customize and create new versions of digital learning 
environments such as intelligent tutoring systems and simulations 
[15]. However, if these environments use integrated automated 
systems, such as classifiers, customization can be problematic: a 
new environment invalidates previous automated systems and 
participant data does not yet exist to train new ones. Therefore, 
teachers who author these learning environments must implement 
them, at least initially, without a key component of the 
technology. 


For example, virtual internships are online simulations of 
professional practice where participants play the role of interns at 
a fictional company [14]. During virtual internships, participants 
complete activities and submit work in the form of digital 
notebook entries. Typically, these are short answer responses 
ranging from a few sentences to a paragraph in length. Prior work 
has investigated automated assessment of notebook entries by 
training classifiers on participant data [10]. However, since the 
development of the Virtual Internship Authoring Tool [18], 
teachers can now customize activities and their notebook 
requirements. Thus, previously developed classifiers may no 


longer be valid and, initially, participant data is not available to 
use for model training. 


In this paper, we present and test a method that addresses this 
issue by generating classifiers from specifications that teachers 
provide during the authoring process rather than waiting to 
generate them from participant data. Ultimately, these classifiers 
will be integrated into a fully automated assessment system that 
will score participant notebook entries. In this study, however, we 
only report on the development of classifiers for determining 
whether teacher defined requirements are present or absent in an 
entry, not classifiers that assign a final assessment. 


2. BACKGROUND 


Several automated essay scoring systems [3, 8, 16] have been 
developed to tackle the challenges of costs, reliability, generality 
and scalability while assessing open-ended essays. Previous 
researches on automated essay scoring focused on_ the 
argumentative power of an entire essay, while in our case, the 
student generated content is typically short text the length of a 
sentence or paragraph. Also, the focus of our assessment is to 
classify the content based on the presence or absence of semantic 
content defined by teachers during their authoring process. This 
means that style and higher-level constructs, such as rhetorical 
structure, are less important in our task compared to essay scoring 
and that factors that focus more on content measures are more 
important. Therefore, we limit our work to a semantic similarity 
approach and Regular Expression (RegEx) matching approach to 
identify the presence of targeted semantic content in participant 
generated text. 


Various methods of text similarity measures have been used from 
the very early years of information retrieval. One of the simplest 
approach is to use the lexical overlap between the texts, however 
this approach does not consider the semantic relation between the 
words. Salton & Lesk [13] used is term frequency based vector 
model for documents similarity. Such model fails when two texts 
with same meaning have few overlapping words. Other 
approaches use knowledge base such as WordNet to find 
semantically similar words in two text [4, 9]. However these 
approaches face challenges of word sense disambiguation. Other 
approaches use LSA or LDA methods that rely on large corpus 
and do not face word sense disambiguation challenge [11]. 


Rus et al.[11] collected a large corpus of student-generated 
paraphrases and analyzed them along several dozen linguistic 
dimensions ranging from cohesion to lexical diversity obtained 
from Coh-Metrix [5]. They used the most significant indices to 
build a prediction model that can identify true and false 
paraphrases and also several categories of paraphrase types. Our 
work is significantly different than their work as our classifier 
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model does not rely on participant generated content (we develop 
classifiers from teachers specifications of content before any 
participant response is available), secondly our paraphrase 
detection model measures semantic relation between the text 
without depending on linguistic features such as content word 
counts. 


Our LSA based similarity method relies on the combination of 
constituent words a phrase. Hence the similarity score will be 
more biased towards phrases having common words. While the 
Neural Network (NN) based semantic similarity method proposed 
by [7, 17], which we also explored, projects the phrase pairs into 
common low dimensional space hence the similarity score 
obtained will be more consistent irrespective of the presence of 
common words in the phrases. 


Our work closely relies on previous works [2, 4, 9] where the 
authors proposed methods to measure the semantic similarity 
between texts. The authors in [2] and [4] used knowledge bases 
such as WordNet while the authors in [9] used word to word 
similarity and vectorial representation of words derived using 
Latent Semantic Analysis (LSA) to compute the semantic 
similarity of two given texts. In addition to these methods, we 
used in our work presented here phrase vectors generated using 
Neural Network based models [7, 17] 


Our work is also partially related to the work by Cai et al.[1], 
which proposed methods to evaluate student answer in an 
intelligent tutoring system. They used LSA and RegEx to assess 
student answers. Their work showed that the carefully created 
RegEx had high correlation with human raters’ scores. They also 
noted that the correlation increased when the expected answers 
created by experts were combined with the previous students’ 
answers to assess new student answers. 


3. METHODS 


We developed three different types of classifier models and 
evaluated their performances separately. 


To generate our classifiers, we worked with data from one teacher 
as she authored an activity in the virtual internship, Land Science. 
In Land Science, participants work to design a city zoning plan 
that balances the demands of stakeholders who advocate for 
indicators of community health. In the activity that this teacher 
customized, participants describe their proposed zoning changes 
in a notebook entry. In the first step of our method, the teacher 
defines assessment criteria for an entry in terms of core concepts, 
or the key semantic content they want to be present or absent in an 
entry. For this entry, the teacher defined five core concepts (see 
Table 1). Next, she constructed six example entries and identified 
the chunks of text in each example that expressed each concept. In 
addition, she provided lists of keywords for each core concept that 
she expected to be present in participant notebook entries. 


Afterward, we developed various classifiers for each core concept 
based on the teacher provided items: sample responses, core 
concepts, and concept keywords. In this paper, we report three 
such classifier types; The LSA based semantic similarity threshold 
classifier, the NN based semantic similarity classifier, and the 
RegEx based classifier. 


In both the LSA based and NN based classifiers, we use a sliding 
window to search for the most similar chunk in an intern’s 
notebook entry. That is, for each teacher-defined chunk, we slide 
a window of equal size over the student entry. For each such 


participant-chunk identified by the sliding window over the 
student’s notebook entry, we calculate the semantic similarity of 
the text within the window to the teacher-defined chunk. After the 
similarity of all windows to a teacher-chunk has been calculated, 
we assign the highest value as the similarity score for a given core 
concept. For LSA based classifiers, we calculated the similarity 
score using SEMILAR [12]. For the NN based classifier, we 
calculated similarity score using the Sent2Vec! tool. Since both 
the tools are capable of taking phrases or sentences as input, we 
give the chunks as input phrase, hence in the rest of the sections, 
we call these chunks as phrases. 


If the highest similarity score is high enough, e.g. higher than a 
threshold, we decide the target core concept is present in the 
student response. Otherwise, we infer the student respond does 
not include the core concept. That is, we developed a semantic 
similarity based classifier for assessing students’ responses. 


In order to choose a threshold for the similarity based classifiers, 
we derived a threshold by calculating the similarity score between 
the chunks of each of the core concepts tagged by the teacher for 
both LSA based and NN based methods. See the experiment 
section for details. 


To test the validity of our approach, we developed classifiers for 
each target concept and then tested them using 199 participant 
entries coded by humans for the presence or absence of each core 
concept. 


Because our initial thresholds were created without the aid of 
participant data, we expected that better thresholds would exist. 
We therefore sought to compare the performance of our classifiers 
using two different thresholds, the derived thresholds above and 
ideal thresholds (described in more detail below). To calculate the 
ideal threshold for each classifier we varied the semantic 
similarity thresholds from zero to one and obtained precision and 
recall measures for each threshold using participant data. 


For the RegEx based classifiers, we used the teacher provided 
keywords, which were generated without using participant data, to 
create regular expression lists for each core concept. We infer that 
the target core concept is present in a given entry as long as any of 
its associated keywords are present, as determined by regular 
expression matching. Therefore, in contrast to the LSA and NN 
models, a threshold is not required for the RegEx classifiers. 


The semantic similarity approach minimizes the teachers’ input 
which encouraged us to adopt it for assessing participant 
responses with respect to containing (or not) targeted, required 
concepts. This method is also relatively easy to automate, 
meaning that after the teacher has made a small set of 
specifications, classifiers can be developed without further human 
input. The RegEx approach is less flexible compared to the 
semantic similarity approach as novel expressions of a core 
concept, not encoded yet in the regular expressions, are less likely 
to be correctly identified. However, the RegEx is capable of 
identifying core concepts that are characterized by a closed set of 
keywords and semantic similarity may not be able to perform as 
needed. 


‘https://www.microsoft.com/en- 
us/download/details.aspx ?id=52365 
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4. EXPERIMENTS AND RESULTS 


First, we describe the data set we used in our experiments and 
then present the results obtained with our automatically generated 
classifiers. We also apply these classifiers to participant generated 
notebook entries to assess the performance of our models on 
unseen data. 


4.1 Data Set 


As we mentioned above, our classifiers were generated from 
specifications made by a teacher as she customized an activity in 
Land Science. To evaluate our method and test how our classifiers 
would perform on unseen data, we selected 199 participant entries 
from prior, uncustomized, implementations of Land Science. We 
took these entries from uncustomized versions of the activity the 
teacher in this study worked to customize. In this case, the 
customizations to this activity’s notebook requirements and 
assessment criteria, as defined by the core concepts, were not 
drastically different from the requirements and criteria of the 
original activity. Thus, this situation provided a case where we 
could test our classifiers on data that was expected to contain 
some distribution of the core concepts. In general, however, our 
method for generating classifiers is meant to accommodate both 
small customizations, such as we have here, and more drastic 
ones, such as a case where a teacher creates an entirely new 
activity. Therefore, we cannot always expect to have such similar 
data for testing. 


The 199 participant entries were manually coded for each core 
concept by two raters. Both raters had worked with the teacher in 
this study to define the core concepts and had extensive prior 
experience coding notebook entries from Land Science. Using the 
process of social moderation [6], the raters agreed on the presence 
or absence of each core concept for each of the 199 entries. From 
Table 1, we see that the distributions of some concepts are 
balanced (C2), while others are skewed (C5). However, because 
we built classifiers based on the textual features of teacher 
samples, skewness should have a small effect on the performance 
of the model. 


Table 1. Distribution of concepts in data set 


Concept Notations #Concepts %Concepts 
land use changes Cl 141 72.860 
original land use C2 114 57.280 
configuration 
location of land C3 79 39.690 
use change 
indicator changes C4 128 64.320 
stakeholder On) 46 23.110 
demands 


4.2 Threshold Initialization Method 


To derive a similarity score threshold, which is needed for the 
semantic similarity based classifiers, we calculated the similarity 
scores between the tagged chunks of text for each core concept in 
the teacher provided examples. Next, we calculated the average 
and standard deviation of these scores and set our threshold as the 
average similarity minus one standard deviation for each core 
concept. The values we obtained using this approach are reported 
in Table 2, where the last column is the derived threshold for each 


classifier. Table 2 shows thresholds for both LSA based similarity 
and the NN based model. 


Phrase similarity based on LSA relies on the combination of 
constituent words a phrases. Hence the similarity score will be 
more biased towards phrases having common words. While the 
NN based semantic similarity method [7, 17] projects the phrase 
pairs into common low dimensional space hence the similarity 
score obtained will be more consistent irrespective of the presence 
of common words in the phrases. 


Table 2. Derived threshold for LSA based and NN based 
similarity method 


Classifier Avg. Std. Avg. - Std. 
Cl LSA 0.584516 0.228474 0.356042 
NN 0.437065 0.122893 0.314172 
C2 LSA 0.239488 0.189726 0.049762 
NN 0.242053 0.168682 0.073372 
C3 LSA 0.696795 0.103681 0.593114 
NN 0.523347 0.077424 0.445923 
C4 LSA 0.278877 0.170271 0.108607 
NN 0.174579 0.124677 0.049902 
CS LSA 0.466482 0.196369 0.270113 
NN 0.149499 0.096005 0.053494 


Note: Avg.=average similarity score, Std=standard deviation. 


In Table 2 it is also observed that the standard deviations of 
similarity scores for NN based models are less than that of the 
LSA based semantic similarity model in all the five classifiers. 
This validates our previous understanding that LSA _ based 
similarity measures is more biased towards phrases with high 
degree of word overlap and gives lower score for the phrases with 
lower degree of or word overlap, resulting high variation in the 
score. On the other hand, NN based method does not suffer from 
such biasedness. 


4.3 Results 


We now present precision and recall results for LSA based and 
NN based models for the derived thresholds presented earlier and 
for ideal thresholds (described next). Afterward, we present 
results for the RegEx based classifiers. 


As an alternative to deriving classifiers based on teacher-specified 
input, we wanted to see how well our methods performed when 
trained on actual, participant data. That is, when the threshold 
used in the classifiers to make the final decision was fit based on 
actual participant data. We call such participant data-trained 
threshold, the ideal threshold. This ideal threshold could only be 
computed when participant data is available, which is a major 
constraint when developing a new internship, as we pointed out 
earlier. 


Figure 1 and 2 shows the precision and recall plot for increasing 
thresholds of LSA based and NN based similarity methods. These 
plots were obtained by comparing the model classifications to the 
manual classifications on the 199 participant entries. It is 
generally seen that whenever precision increases at a particular 
threshold, the recall decreases or vice versa. The point of 
intersection of the precision and recall for a particular classifier 
gives the ideal precision and recall—that is, the classifier has 
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balanced performance in terms of precision and recall. From the 
figure, it is clear that if we want fewer false negatives, for 
example, the value of the threshold should be increased. In such a 
case, the precision will be compromised. Therefore, the threshold 
should be chosen carefully not to compromise either precision or 
recall to an undesirable extent. 


The results obtained with ideal and derived thresholds are 
summarized in Table 3. These data suggest that, for the ideal 
thresholds, the LSA based classifiers for core concepts Cl 
through C4 performed well with the lowest precision and recall 
value being 0.72. However, the NN _ based classifiers 
outperformed the LSA classifiers for all core concepts other than 
C2. LSA based models depend on the overlapping content words 
in phrases and the performance suffers in cases where the phrases 
contain out of vocabulary words. Out of vocabulary here means 
the LSA similarity relies on pre-built vocabulary from a large 
corpus that does not contain some of the words, such as proper 
nouns that are specific to Land Science. However, NN based 
similarity models rely on letter trigrams from a very large corpus, 
and every input phrase is converted to letter trigrams. Therefore, 
the NN based models are capable of capturing the semantics even 
when there are out of vocabulary words in the phrases or context 
of the phrases. Hence, the NN based classifiers are superior for 
these concepts. However, for C2, the NN based classifier lagged 
in performance by 2% in precision and recall compared to the 
LSA based classifier because the teacher samples used for C2 
contained only short phrases with very few context words and 
some of the overlapping words in the phrases boosted LSA based 
classifiers. The classifier C5 performed poorly for both LSA and 
NN based classifiers. 


Table 3. Precision and recall for ideal and derived thresholds 
for LSA based and NN based similarity method 


Threshold Precision Recall 
I D I D I D 
Cl LSA 0.36 0.35 (0.84 0.82 0.84 0.86 


“NN 0.34 (0.31.—(0.86 (0.84. (0.86 0.92 
C2 LSA 0.80 0.05 0.80 0.57. 0.801.000 - 

“NN 0.52. 0.07. 0.78 (0.57. s«<0.78~—s«*1.00 © 
“C3. LSA 0.38 «(0.59 (0.82 :0.920.82:0.80 

“NN 0.36 044 0.86 0.96 0.86 0.78 
C4 LSA 0.56 O11 0.72 0.64 0.72 1.00 — 

“NN 0.46 0.05 0.74 0.64 0.74 ‘1.00 © 
CS LSA 100 027 #0 022 #0 0.98 

“NN 0.80 005 O 023 #O 1.00 
Note: I=ideal, D=derived. 


For the LSA based classifiers, the highest precision using derived 
thresholds was 0.92 with recall of 0.80 for C3 and the lowest 
precision was 0.22 with recall of 0.98 for C5. As we saw with the 
derived thresholds, NN based classifiers generally outperformed 
their LSA based classifiers counterparts, with the exception of the 
recall for concept C3 


Classifier 


The results in Table 3 suggest that a good threshold could be 
derived without participants’ data. The high recall and precision 
using derived thresholds for concepts Cl and C3 suggest the 
possibility of assessing the core concepts in participant notebook 
entries with classifiers generated using only the teacher's sample 


responses. However, when compared to the results using the ideal 
thresholds, classifiers C2, C4 and C5 did not perform well; their 
derived thresholds differed largely from their ideal thresholds, and 
their precision and recall suffered. The relatively low derived 
threshold values for these concepts suggests that their associated 
examples, which were used to calculated the thresholds, were 
semantically dissimilar. Dissimilar examples for a given concept 
could imply an ill-defined concept and that the provided examples 
do not represent it well. Alternatively, dissimilar examples could 
imply a complex or varied concept that requires highly different 
examples to represent it fully. Because we cannot distinguish 
between these cases automatically, we plan in future work to set a 
best guess threshold of 0.5 in such cases. 


Table 4. Performance of regular expression model 


Concepts Precision Recall 
Cl 0.963 0.551 
C2 0.640 1.000 
C3 1.000 0.746 
C4 0.791 0.890 
C5 0.894 0.739 


Table 4 shows the precision and recall of RegEx based classifiers. 
Here the performance for concepts C2, C4, and C5 is more 
interesting when we compare those values with the previously 
discussed result. For example, the precision and recall for C5 
improved impressively with values 0.89 and 0.73 respectively, 
whereas in previous case those values were either undefined or O 
precision with recall 1. Furthermore, the precisions of Cl and C3 
are high, however the recalls are relatively low. Qualitatively 
investigating these results suggested that participants entries 
expressed these concepts in a variety of ways that were not 
captured by the regular expression lists. 


Given that we see improvements for some core concepts using the 
regular expression based approach, these results suggest that the 
teacher provided samples on which the similarity measures where 
based may not have included a variety of key terms that could 
indicate the presence or absence of these core concepts. 
Comparing the sample responses and the keywords provided 
revealed that the samples indeed did not contain many of the 
keywords in the list. In some cases, the keywords were synonyms 
or other instances of particular kinds of words provided in the 
sample responses. For example, in Land Science, there are sixteen 
stakeholders who give demands on zoning plans. The core 
concept C5, stakeholder demands, is meant to capture references 
to these 16 stakeholders in participant notebook entries. 
Examining the teacher provided samples, we found that only four 
stakeholders were covered, while the keyword list for the core 
concept mentioned all sixteen. We plan in future experiments to 
either ask teachers to provide enough samples to cover finite sets 
of semantic content such as this or to incorporate the provided 
keyword list into the semantic similarity methods as extra 
samples. 
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7 Precision and Recall by Threshold 


1 O- ~~ On neGege oe ---6=-- 


0.8 
SX 0.6 
0.4 
0.2 
0 
0 0.2 0.4 0.6 0.8 1 1.2 
Threshold 


=—@—C1 =—@ (2 =—@ C3 -—@ C41 —e6 
Figure 1. Precision and recall for LSA based similarity thresholds (solid lines are precision; dotted lines are recall) 
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Figure 2. Precision and recall for neural network based similarity thresholds (solid lines are precision; dotted lines are recall) 


5. CONCLUSIONS 


In this paper, we investigated a method for creating classifiers 
for virtual internship notebook entries using teacher provided 
specifications without the use of participant data. Our classifiers 
used LSA based and NN based semantic similarity methods to 
capture the general semantic relationships among concepts. We 
also investigated regular expression based classifiers. The results 
are impressive in the sense that some classifiers, using both LSA 
and NN, gave high precision and recall values using thresholds 
derived without participant data, which suggests that our general 
method is plausible. 


Furthermore, the superiority of the NN classifiers over the LSA 
classifiers suggests that NN methods are preferable when the 
participant responses vary widely in terms of style, content, and 
word overlaps with the teacher provided sample response. 


The improved performance for some core concepts, such as C5, 
using regular expression based classifiers implies that such 
classifiers performed better for concepts whose sample 
responses did not contain a variety of keywords, despite the 


benefits we saw for NN models. These results suggest that, in 
some cases, teachers may need to provide more exhaustive 
samples, and that provided keywords and regular expression 
based classifiers may supplement a semantic similarity 
approach. 


In future work, we will investigate a method to combine the 
classifiers in order to better understand how performance of one 
model is boosted by another in the scenario where participants 
responses vary widely compared to the sample responses. We 
will also see how the performance be affected by setting up the 
thresholds to 0.5 for concepts C2, C4 and C5. 


Our work has several limitations; most obviously, we used 
participant data in to evaluate the performance of some of our 
classifiers. In the real use case of our method, we cannot expect 
to have such data available. We want to make clear, however, 
that our purpose in using participant data was not to train better 
classifiers, but to evaluate our method for generating them. 
Thus, our results suggest that this method can _ produce 
classifiers that would perform well on unseen data, but more 
refinements are needed. 
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ABSTRACT 


We propose a model that employs convolutional neural 
networks (CNN) to evaluate sociomoral reasoning maturity, a 
key social ability, necessary for adaptive social functioning. 
Our model is used in a serious game to evaluate learners. It uses 
pre-annotated textual data (verbatims) and a coding scheme 
(SoMoral) applied by experts in psychology. State of the art text 
classification algorithms (Support Vector Machine, Naive 
Bayes, etc.) achieved low results in our context in contrary to 
the CNN that achieved best results with little fine tuning on the 
input data representation. We use a simple but efficient input 
data vectors representation learnt directly from the dataset 
without loosing the sentences ‘semantic’. We present a series of 
experiments with 5 baseline text classification algorithms and 4 
baseline data representation. The results show that our model 
can predict the level of sociomoral reasoning with about 92% of 
accuracy. Our findings allow not only to advance the text- 
mining field but also the user modeling in highly social adaptive 
systems. 


Keywords: Convolutional neural networks, data vectors 
representation, text classification, moral reasoning, social skills, 
serious game, learner model. 


1. INTRODUCTION 


Sociomoral reasoning (SMR) is a socio-cognitive construct 
essential for appropriate decision-making in social contexts, as 
well as for social adaptation. It is commonly defined as how 
individuals think about moral emotions and conventions that 
govern social interactions in their everyday lives [2]. The ability 
to predict and identify individual’s sociomoral reasoning 
maturity level is a key step to quantifying peoples’ social 
functioning and can be used to identify those at-risk for 
maladaptive social behaviour and orient them towards 
appropriate services. We propose a model and a simple input 
data representation for predicting the level of SMR maturity of 
an individual based on the justifications they provide when 
solving sociomoral dilemmas. A computerized test was 
designed, the Socio-Moral Reasoning Aptitude Level (So- 
Moral), in which children and adolescents are presented with 
visual social dilemmas from everyday life and asked to 
determine how they would react and provide a justification for 
their answer [21]. A serious game was designed based on the 
original tasks, and our model was designed to evaluate subjects 
using existing verbatims and scoring by experts that use the 
moral maturity coding scheme inspired by a_ cognitive- 
developmental approach [7]. The proposed model can be seen as 
a supervised text classification task. 


Text classification is the task of automatically assigning classes 
to sentences or documents. There exist several supervised 
classification algorithms that have achieved good results in text 
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classification tasks (Sentiment analysis [15, 19], topic mining 
[5S], etc.) such as Support Vector Machine (SVM), Latent 
Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) or 
MLP (Multilayer perceptron) [1]. While their primary use has 
been in image classification and speech recognition, deep 
learning techniques (such as Convolutional Neural Networks) 
have recently been used for text classification and have achieved 
remarkable results [8, 11, 23]. A text document is characterized 
by the words it contains, and consequently the representation of 
textual data is only based on its words [10]. Thus, an important 
feature in text classification is the word vector representation of 
input data. Bag-of-words (BoW) vectors representation is the 
simplest and most widely used representation where vectors 
indicate which words appear in the documents without 
preserving word order. Vectors from BoW lack semantics and 
are usually huge and sparse. Alternative solutions have been 
proposed such as n-gram models [18] (bi-gram, tri-gram, etc.), 
word2vec or wordnet. However, to be effective, models that use 
n-gram or word2vec require a huge dataset and sentences or 
words that are frequently observed. Similarly, the use of 
wordnet is language dependant. 


To benefit from word order and the annotated dataset, we built 
our classifier using CNN and a simple but effective data 
representation approach called class-based representation 
(CBR). CNNs are neural networks with layers representing 
convolving filters applied to local features [12]. The application 
of CNN on text classification makes use of the 1D structure 
(word order) of text data so that each unit in the convolution 
layer responds to a small region of a document (a sequence or 
pattern of words) [8]. CNN can extract deep features from data 
which can improve discriminate classes. 


1.1 The Les Dilemmes serious game 


One of the objectives underlying the development of the 
proposed CNN is to implement the automated scoring 
mechanism in a serious video game called Les Dilemmes. It is a 
first-person serious game which aims to assess and train the 
social reasoning skills of the player. It is a virtual environment 
offering an interactive context which is emotionally, socially 
and cognitively rich. Players face different socio-moral 
dilemmas in a 3D environment in which they have to make 
decisions and are asked to provide oral justifications for the 
choices they make. They can also ask the opinions of virtual 
friends (non-player characters) in the game. Their answers are 
selected from previous recorded verbatims from the different 
moral maturity levels according to the coding scheme (SoMoral 
[2]). The learner (player) model implemented in the learning 
environment includes 3 keys dimensions: the affective state, the 
cognitive profile and, the sociomoral reasoning profile. 
Therefore, sociomoral reasoning skill is part of the player model 
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implemented in the game. As stated in [3], a learner model that 
can accurately represent the learner longitudinally in a game 
leads to efficient adaptation, which in turn helps increase player 
satisfaction and his motivation. To this end, it is important to 
ensure the effectiveness of the learner model before deploying 
the system for real uses. 


Through this work, we aim to build an effective model of the 
sociomoral facet of the player. The level of sociomoral 
reasoning of an individual is determined from its verbal 
justifications provided when solving the dilemmas. This 
involves the implementation of a model for automatic 
measurement of this level during the game. We have a dataset of 
verbatims coming from the SoMoral experimentation already 
annotated by experts and a description (a paragraph with key 
concepts) associated with each different level (or class) of 
maturity. This paper aims to propose a machine learning model 
that can accurately assess the sociomoral reasoning skill level of 
a player based on his verbatim. In our knowledge, there is no 
research that deals with the automatic classification of 
sociomoral reasoning skills as part of learner-player social 
behaviour in serious games. 


1.2 Sociomoral reasoning skill levels 

The original So-Moral task includes five different levels of 
sociomoral reasoning [2]: (1) Authoritarian-based consequences, 
(2) Egocentric exchanges, (3) Interpersonal Focus, (4) Societal 
Regulation and (5) Societal Evaluation. Transition levels (i.e. 
1.5, 2.5, 3.5, 4.5) are used to account for verbatims that provide 
elements of two reasoning stages and show a_ sequential 
progression from one stage to another. Occasionally, a verbatim 
is assigned to two different closed levels (1 being the maximum 
deviation) when two independent experts annotate the data for 
rater reliability purposes. 


1.3 Dataset 


The dataset consists of a benchmark of 691 verbatims (in 
French) manually coded by experts. Verbatims are short or long 
text fragments containing at least one sentence. They are not 
equally distributed between levels. Table | shows the repartition 
of data where for example levels 4.5 and 5 have a smaller 
number of verbatims than other levels. Level 5 constitutes the 
highest level of maturity and it is therefore more rarely 
attributed to children and adolescent’s socio-moral justifications. 
This implies that certain levels have very few examples to learn 
from. Of the 691 verbatims, 53 were classified as 0, which 
means that the verbatim does not represent one of the 
sociomoral reasoning levels (e.g., the answer provided by the 
participant was tangential and did not contain a justification of 
their social response). We do not consider these cases in our 
study, which reduces our corpus to 638 verbatims. 


Table 1. Distribution of verbatims between levels 


2% 
Pe) 


a [* 
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2. BASELINE METHODS FOR 
SENTENCE CLASSIFICATION 


Since verbatims are annotated text data, we investigated the use 
of some existing sentence classification algorithms. In this 
section, we expose state of the art methods for text classification 
that have shown good results on similar problems. 


2.1 Input representation 
Here, we present some representation techniques that we 
experimented on for determining sociomoral reasoning level. 


Bag-of-words (BoW): BoW is a _ binary word presence 
representation (indicating whether a word is present or not in a 
sentence). Each distinct word in the dataset corresponds to a 
feature in the representation. Each labeled verbatim in the 
dataset is transformed to a vector of N columns, where N (the 
vocabulary size) is the total number of distinct words in the 
entire corpus. 


Matrix Tf-idf: Tf-idf (Term Frequency-Inverse Document 
Frequency) representation allows evaluation of the importance 
of a term contained in a document relative to a collection. The 
tf-idf value increases proportionally to the number of times a 
word appears in the document, but is offset by the frequency of 
the word in the corpus, which helps adjust for the fact that some 
words appear more frequently in general. 


Dictionary of synonyms: We developed a tool to compare our 
representation model with an approach similar to that of wordnet 
and to make use of the concept lists from each level provided by 
experts). The tool takes two words, and for each word, extracts a 
set of synonyms from a free access online French synonyms 
database and then computes the intersection of those sets to 
determine whether the two words are related or not. The So- 
Moral scoring manuel provides a description of what types of 
justifications should be included at each level and a list of 
concepts that describe each level. This information was used by 
extracting keywords (we removed stop words). After this 
process, we obtained a list of 53 words representing all the 
levels, which are used as a vocabulary set for the data. Each 
word is represented by a vector of size 1*53. For each word 
from a verbatim and for each word from the vocabulary list, if 
the intersection of vectors is not null, then it is given a code of 1, 
otherwise, it 1s coded 0. 


Word2vec: It is common in sentence classification to use 
publicly available word2vec vectors that are trained on over 100 
billion words from Google news [11]. This technique usually 
works with sentences in English. Instead of directly using those 
pretrained vector representations, others try to learn those 
vectors directly from their dataset. We also attempted to 
represent our data with word2vec vectors that were trained on 
our corpus. 


2.2 Supervised classification algorithms 

There exist several supervised classification algorithms. Among 
them, we selected ones that generally produce excellent results 
in text classification. 


SVM (Support Vector Machine): The learning algorithm 
consists of finding a hyperplane, which separates the levels 
appropriately by limiting the error rate of classification in the 
new data. The aim is to maximize the distance of the vectors 
close to the hyperplane for each of the levels, which avoids 
overfitting. Although this algorithm is more suited to binary 
class problems, the aim was to explore its behavior on our 
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dataset since it generally provides good outcome on text 
classification [1, 6]. 


NB (Naive bayes): NB is a probabilistic classifier based on the 
Bayes theorem with a naive assumption of attribute 
independence [17]. It is generally used in the detection of spam, 
sentiments analysis and in the medical field. The principle is to 
compute the posterior probability of the class for a given 
document, and the class with the highest posterior probability is 
then assigned to the document. We chose to experiment with NB 
because it is fast [16] and easy to implement especially in real- 
time applications. 


LSA (Latent Semantic Analysis): LSA is an algorithm that has 
been developed specifically for mining textual data. This 
algorithm allows us to take into account semantics, which very 
few algorithms offer. It is an interesting technique because it 
does not consider any information related to language 
processing (meaning of words, dictionaries etc.). This makes it 
possible to establish relations between a set of documents and 
the terms it contains by constructing "concepts" related to 
documents and terms [13, 20]. 


LDA: LDA is a machine learning technique that has 
revolutionized the extraction of latent subjects in texts [4]. It 
tries to create topic clustering of documents that are similar to 
each other. Each document is represented as a mixture of topics. 
We trained the LDA model on our 638 verbatims by setting the 
number of topics to 5 or 9 (depending on the problem). The 
classification of a new verbatim was achieved by computing the 
cosine similarity between the verbatim and each of the topic 
probability distribution vectors over words. 


MLP (Multilayer Perceptron): MLP is a feedforward artificial 
neural network model with one or more layers between hidden 
layers that maps sets of input data onto a set of appropriate 
outputs. MLPs are widely used for pattern classification, 
recognition, prediction and approximation. MLPs are able to 
learn non-linear models, but require tuning of a number of 
hyperparameters such as the number of hidden neurons, layers, 
and iterations. 


3. THE PROPOSED MODEL 
3.1 Class-based Representation (CBR) 


The poor distribution of verbatims over levels (see Table 1) and 
the fact that some of the keywords that can aid in the 
discrimination of levels appear just once or twice in the entire 
corpus, makes the application of some data representation 
techniques inaccurate (e.g. BoW, tf-idf). Also, verbatims are 
sentences that are semantically very rich and of varying sizes; 
state-of-the-art techniques often fail to accurately classify this 
type of data (see Section 5 for details). 


We propose a simple yet fast and efficient representation model 
for data that use only an annotated dataset. Using this technique, 
we gain over 10% accuracy compared to all other classification 
techniques previously presented, and over 30% accuracy 
compared to some state-of-the-art representation models. A 
further advantage of the proposed representation is that it is not 
language dependent. It does not consider any information related 
to language processing (meaning of words, dictionaries etc.), 
which can be time consuming. 


We represented each verbatim as a feature vector, whose values 
(1 or O) accounted for the presence of a word in a level. The idea 
of CBR is simple: if a word appears in the verbatim of one level, 


then it must be semantically correlated with that class. In turn, if 
a word appears in verbatims from different classes, then it must 
be semantically correlated with all the classes, but has less 
significance than a word that appears only in verbatims from one 
of those levels. For example: we have 4 levels, and we have 2 
verbatims from levell and level 2 (see Table 2a); Table 2b 
shows the representation of two words in this specific case (1 
means semantically correlated and 0 means uncorrelated). 


Table 2. a) Examples of two verbatims 


Parce que c'est mal et elle n'apprendrait pas de ses 
erreurs (Because it's wrong and she won't learn from 


her mistakes) 


C'est tricher (It’s cheating) 


b) Examples of CBR on the 2 verbatims from a). 


Wworancase TTF TS 
Cs 


The input data for the CNN model is a matrix with 5 columns 
and 88 lines, which correspond to the length (number of words) 
of the longest sentence of the corpus after data pre-processing. 


3.2 The CNN Model 


According to LeCun and colleagues [14], deep learning allows 
computational models composed of multiple layers of 
processing to learn data representations at multiple levels of 
abstraction. Deep learning techniques such as CNN have been 
shown to be effective for Natural Language Processing (NLP) 
and have achieved excellent results on sentence classification 
[11, 24], sentence modeling, and semantic parsing [9]. They can 
explore small text regions to learn useful features for 
categorization [8]. The CNN we are proposing requires as input 
a vector representation (88*5 or 88*9) of verbatims that 
preserves the internal order of words, as in class-based 
representation. 


Parameter selection: The hyper-parameters of our CNN, such 
as the size of filters and the number of layers, were chosen based 
on the results obtained empirically from several tests on our 
dataset. The structure of the CNN consists of two layers of 
convolution, two layers of maxpooling and one layer fully 
connected to the output. The fully connected layer of our model 
uses 40 rectified linear units. The structure also includes two 
Filter windows, one of size 1x5 for the 5-level classifier (1x9 for 
the 9-level classifier) and the other of 2x1 in size. The first filter 
window is used to implement the convolution on the input data. 
Using a 1-dimension window here allows exploration of the data 
one word at a time in order to derive specific features associated 
with each word (which contributes to determining the semantics 
of the word). Following this step and the maxpooling of its 
output, another filter is used for a second convolution. This 
second convolution aims at extracting features related to word 
order (or text regions). A filter vector of 2x1 (for exploring the 
text regions) is used for this purpose. There are 20 filters in each 
convolutional layer. The batchsize was set to 500 and the 
number of iterations to 250. 


4. EXPERIMENTS 


Our experiments involve the five classification algorithms, 
Naive Bayes (NB), LSA (Latent semantic Analysis), LDA 


Proceedings of the 10th International Conference on Educational Data Mining 286 


(Latent Dirichlet Allocation), MLP (Multi Layer Perceptron) 
and Support Vector Machine (SVM) that we presented earlier 
in this paper. The goal of using all these algorithms is to 
compare the models obtained from them with that obtained from 
the CNN-based model. We _ explored existing input 
representations of data and compared results with the CBR. 


4.1 Data pre-processing 
For consistency between different input representations and 
algorithms, we used the same pre-processing steps for the data. 


Stop-word removal: Generally, the very first step to reduce the 
vector size of the data is to remove stop-words (connective 
words, such as “a”, “in”, “the” in English). Alone, they are 
considered lacking semantic to give information to the classifier 
[1]. Unfortunately, the typical list of stop words for the French 
available online gave poor results in our classification task. 
Instead, we excluded common words in the verbatims, which 
were not discriminatory for the different SMR levels. 


Lemmatization: This is the process of mapping words onto their 
base form [10]. For example, the words “installed”, “installs” 
and “installing” are mapped to “install”. This mapping makes 
the binary presence of word representation approaches treat 
words of different forms as a single feature, hence reducing the 
total number of features. We used the Stanford NLP tools to 
apply a French lemmatization to the verbatims. 


4.2 Results 


Accuracy is typically used as the standard measure for 
classification performance. However, for datasets with an 
unbalanced distribution such as the one used here, this measure 
can be illusory and not very informative about the errors being 
committed by the classifier. Instead of relying solely on 
accuracy, we used the Fl score (or F-measure) which takes into 
consideration both precision and recall. To provide a point of 
reference for our CNN model results using our proposed input 
representations, we first report the performance achieved using 
baseline techniques for sentence classification. We _ report 
Accuracy and Fl-score over all datamining techniques and 


datasets in Tables 3 and 4. First, we used only the BoW and tf- 
idf representations as input representation for the algorithms. 
SVM was run with the RBF (Radial Basis Function) as kernel 
function. LSA is an algorithm which initially works with tf-idf 
representation, that is why we have the n/a (not applicable) 
mention. Table 3 shows the results. For a second experiment, we 
used dictionary of synonyms and the CBR representation as 
input representation techniques for the MLP and the CNN. We 
have kept only those 2 algorithms for the next step because of 
their good results compared to others on step 1. 


Table 4 shows results. Erreur! Source du_ renvoi 
introuvable.Figure 1 graphically shows the performance of 
MLP and CNN on both the dictionaries synonyms and classes- 
based techniques and on the 2 types of problems previously 
mentioned in section 2 (5 and 9 classes). 


For all the algorithms, we trained the models on 75% of data 
(which is about 500 verbatims) and we tested on the remaining 
verbatims (138 verbatims). 


5. DISCUSSION 


We begin our discussion by looking at the most basic 
representations, those involving the BoW and the tf-idf (Table 
3). We note that none of the 5 baseline algorithms were able to 
classify at least the half of the data with the BoW representation 
technique. Only NB and MLP were able to classify more than 
50% with tf-idf. However, the Fl score remains relatively low in 
general. Furthermore, for SVM, LSA and LDA, all the 
verbatims in the test data were classified as level 1. For NB, they 
were classified into levels 1 and 3. The reason for these 
misclassifications can be seen in Table 1, where levels 1 and 3 
are the most represented in the dataset. This brings us to the 
conclusion that those 2 representations depend strongly on the 
distribution of the data into classes. Despite the time-consuming 
learning, CNN and MLP gave the best results. 


Table 3. Accuracy and F1 scores of 6 baseline algorithms for sociomoral reasoning level classification. The input data are 
represented with BoW and Tf-idf techniques. 


Input representation 


Bow 


Accuracy 


F1-score 


Table 4. Accuracy and F1 score of the CNN model and MLP for sociomoral reasoning level classification. The input data are 
represented with class-based and dictionary of synonyms techniques. 


MLP (5 classes) | CNN (5 classes) MLP (9 classes) | CNN (9 classes) 


Dictionary 
synonyms 


of 66.33 71.25 44.00 52.00 
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Table 5. Results of the CNN model and MLP with errors margins (1 for the 5 classes and 0.5 or the 9 classes). 


Input representation | MLP (5 levels) CNN (5 levels) MLP (9 levels) CNN (9 levels) 


84.28 92.00 63.52 $4.00 
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80 


Dictionaries of synonyms 


m@ MLP (5 classes) @ CNN (5 classes) 


Classes based 
representation 
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10 

QO 


Bow TF-IDF 


w hMMLP (9 classes) @ CNN (9 classes) 


Figure 1. Variation of the accuracy of MLP and CNN, based on input data representation techniques. 


In 


Table 4, we ran our CNN model and MLP with the dictionary of 
synonyms and class-based techniques. We also considered the 5 
and 9 levels problem. At first glance, we see that the CBR gives 
the best results compared to other representation techniques. The 
best result was obtained from our CNN model. The model 
provided 85% accuracy and 83% FI score, which is an 
acceptable result for the problem. The training of the CNN took 
more time than other techniques because we needed to find the 
parameters that achieved the best results. We limited the number 
of iterations to 250 to avoid overfitting. Over the 250 iterations, 
we obtained poorer classification results on test data, but over 
99% on training data. We also note that the results vary 
considerably based on the training set, suggesting that selection 
of the training set is an important part in the pre-processing of 
data for the CNN model. 


Why does CNN give the best results? 


The size of filters (number of lines) in the CNN can be 
compared to the idea behind N-Grams. The convolution is done 
on 2, 3 or 4 words at a time if the filters are respectively of size 
2, 3 and 4. So, the CNN takes into account the order of the 
words in sentences. Another reason for the better results with the 
CNN compared to other techniques is that it can extract deep 
features (e.g., semantically grounded) using a series of 
convolutions, filters, feature maps and pooling on data, which 
help in the discrimination of data. The input data representation 
also contributes to this performance. 


Real-life sociomoral reasoning classification 


In manual scoring of socio-moral reasoning, different experts 
occasionally associate the same verbatim with different levels 
because of inherent variability between even expert raters. 
Taking into consideration that even experts can make errors, we 
retrained our model (on both 5-level and 9-level problems) by 
considering a margin error of | for the 5-level problem and of 
0.5 for the 9-level problem. For example, if the model predicts 
that the level of verbatim v1 is 1 and that the real level is 1.5, 
then it is considered as a true classification. 


Table 5 shows the results when error margins are considered. 
We can see that the CNN on the 5-level problem achieves 
exceptional results with an accuracy of 92%, which is the best so 
far. 


6. CONCLUSION 


We propose a model able to predict with over 90% accuracy the 
sociomoral reasoning skill level based on a textual verbatim. 
Specifically, we propose a simple but efficient input text data 
representation that can work with different classification 
algorithms. This work is a considerable contribution in sentence 
classification and in sociomoral reasoning maturity 
classification. Verbatims are typically manually annotated by 
experts. Our proposed model is intended to help them in this 
task and produces results that are comparable to the accuracy of 
independent raters, suggesting promising applications. 


Contrary to state-of-the-art techniques in text classification, the 
CNN model we propose achieves the best results in our context. 
This is mainly due to its deep structure that can learn useful 
features from data. Despite the good results obtained by the 
CNN, parameters must be manually tuned and require many 
experiments to find the best results. MLP can be treated as a 
lexical mining technique on text, because all neurons on hidden 
layers receive information from all previous neurons (blind 
mining). The order or the meaning of words is not considered. 
On the other hand, CNN can capture deep features from data and 
thus the order (pattern or syntax mining) and the meaning 
(semantic mining) of words, if the representation is good 
enough. Since a sentence is fully defined by its syntax, lexis and 
semantics, a model considering those features will lead to better 
results in sentence classification and even NLP tasks. In our 
future works, we will develop a model based on a pooling of 
MLP and CNN techniques. We will also consider the use of the 
multiple channels features of CNN to combine different 
representation of sentences as reported by Kim [11] and Yin 
[22]. Similarly, while more complex data representations for text 
classification will undoubtedly continue to be developed, those 
deploying such technologies in real-life problems will likely be 


Proceedings of the 10th International Conference on Educational Data Mining 288 


attracted to simpler variants, which afford fast training and 
prediction times such as the CBR model that we propose. The 
only downside of our representation approach is that it requires a 
classified dataset. We will explore the combination of class- 
based approach and others interesting representation techniques 
that use RBM (Restricted Boltzmann Machine) or autoencoders 
in future work, in order to achieve 90% accuracy without 
adjustment for error margins. The proposed coding solution will 
be implemented in the Les Dilemmes video game. The next step 
will be the assessment of the efficiency of the sociomoral 
reasoning dimension as a learner model facet in a highly 
adaptive social serious video game. 
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ABSTRACT 


Existing personalized learning systems (PLSs) have primar- 
ily focused on providing learning analytics using data from 
learners. In this paper, we extend the capability of current 
PLSs by incorporating data from instructors. We propose a 
latent factor model that analyzes instructors’ preferences in 
explicitly excluding particular questions from learners’ as- 
signments in a particular subject domain. We formulate the 
problem of predicting instructors’ question exclusion pref- 
erences as a matrix factorization problem, and incorporate 
expert-labeled Bloom’s Taxonomy tags on each question as 
a factor in our statistical model to improve model inter- 
pretability. Experimental results on a real-world educational 
dataset demonstrate that the proposed model achieves supe- 
rior prediction performance compared to several other base- 
line methods commonly used in recommender systems. Ad- 
ditionally, by explicitly incorporating Bloom’s ‘Taxonomy, 
the model provides meaningful interpretations that help un- 
derstand why instructors exclude certain questions. Since 
instructor preference data contains their insights after years 
of teaching experience, our proposed model has the poten- 
tial to further improve the question recommendations that 
PLSs make for learners. 


Keywords 
personalized learning, educational data mining, latent factor 
model, Bloom’s Taxonomy 


1, INTRODUCTION 


Today’s education system has largely remained a “one-size- 
fits-all” learning experience in which the instructor selects a 
single learning action for all learners, ignoring their diverse 
backgrounds, interests, and goals. Modern machine learn- 
ing (ML) techniques have led to a great acceleration in the 
development of personalized learning systems (PLSs) that 
have the potential to revolutionize education by delivering a 
high-quality and affordable personalized learning experience 
at large scale. 


Current PLSs generally perform learning analytics using 
only learner data, overlooking data that instructors gener- 
ate. However, when instructors are present in educational 
settings such as traditional classrooms, they generate im- 
portant data that reveals how they prefer to interact with 
learning resources. Augmenting current learning analyt- 
ics approaches by modeling instructors’ preferences clearly 
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provide advantages, since their preferences reflect years of 
teaching experience and thus provide valuable insights on 
how to utilize learning resources effectively. As a result, 
PLSs can refine their learning resource recommendations 
for learners using both learner data and these valuable in- 
sights. Additionally, analysis of instructor preferences for 
learning resources can serve as a starting point of recom- 
mending learning resource to learners when learner data is 
scarce such as at the beginning of a semester. 


In this work, we focus on a specific instance of instructors’ 
content’ preferences. We collect instructors’ preferences to 
exclude questions from being given to learners in their class 
via OpenStax Tutor|13], a personalized learning and teach- 
ing platform. OpenStax Tutor has a functionality to auto- 
matically select homework assignment questions for learners 
from a question corpus. At the same time, it allows instruc- 
tors to exclude questions they do not want OpenStax Tutor 
to assign to learners in their classes from the corpus. While 
this exclusion option allows more flexibility for instructors 
to control homework assignment questions that learners re- 
ceive, manually selecting questions to exclude from a (po- 
tentially huge) corpus is a labor-intensive process. As a re- 
sult, analyzing instructors’ question exclusion behavior has 
immediate utility in automating the question exclusion pro- 
cess. 


1.1 Contributions 


With the objective of analyzing instructors’ preferences on 
assigning questions to learners on the OpenStax Tutor plat- 
form, we develop a novel latent factor model that predicts 
instructors’ question preferences in a particular subject do- 
main given previous records of whether instructors choose 
to exclude certain questions from homework assignments. 
The latent factor modeling approach is primarily inspired 
by SPARFA [10] which is a successful latent factor model 
for learner and content analysis. But more importantly, this 
approach allows flexible incorporation of prior knowledge in 
the form of meta-data into the model. Consequently, the 
model that we develop in this work can be easily extended 
to include additional information in the form of latent fac- 
tors to explain instructors’ question exclusion preferences, 


‘From now on, we will use the phrase “learning resources” 
and the word “content” interchangeably. 
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as well as be used in other educational data mining tasks 
where auxiliary information is available. Additionally, our 
proposed model incorporates expert-labeled Bloom’s 'Taxon- 
omy tags for each question to explain instructors’ question 
exclusion preferences, based on the conjecture that instruc- 
tors have varying inclinations towards different Bloom’s Tax- 
onomy tags”. 


Experimental results on a real-world educational dataset 
show that, compared to standard methods used in recom- 
mender systems, our model achieves higher overall accuracy 
in predicting instructors’ question preferences. Additionally, 
we demonstrate that our model is highly interpretable in 
that the Bloom’s ‘Taxonomy explains question preferences of 
individual instructors, and reveals question preference pat- 
terns among instructors. Our analysis of the instructors’ 
question exclusion preferences enables PLSs to incorporate 
instructors’ insights on questions and potentially improve 
the quality of their personalized question recommendations. 


We emphasize that our proposed model is not limited to 
analyzing instructors’ question exclusion preferences; it can 
be easily modified to analyze instructors’ preferences on a 
broader range of learning resources. Therefore, our work 
serves as an initial investigation into extending the capabil- 
ity of existing PLSs with the analysis of instructor learning 
resource interaction data. 


1.2 Related Work 


We formulate the problem of predicting instructors’ question 
preferences as a matrix factorization problem underlying a 
recommender system. Recommender systems often rely on 
collaborative filtering (CF); the two most successful family 
of CF approaches to date are neighborhood-based methods 
and latent factor methods [4]. Neighborhood-based methods 
predict preferences based on neighbors chosen by some simi- 
larity measure. Latent factor methods, in particular, can be 
readily applied to education applications, resulting in ten- 
sor factorization for student modeling [15] and probabilistic 
models such as SPARFA [10], a primary source of inspiration 
for this work. However, these approaches, in their original 
form, do not have mechanisms to incorporate meta-data on 
learners and questions. Therefore, the explanatory power of 
these methods is usually limited. Our proposed model, on 
the other hand, extends the original latent factor model to 
explicitly include the Bloom’s Taxonomy tag of each ques- 
tion as meta-data, providing additional interpretability and, 
at the same time, improves prediction accuracy. 


Works including [6] and [12] incorporate external factors 
such as movie genres to improve users’ movie rating pre- 
diction in the Netflix challenge [2], but their methods do 
not directly apply to education scenarios. 


The work in [14] broadly describes a Bayesian approach to 
model instructors. While our work pursues a similar ob- 
jective, we propose a concrete model with evaluations on 


?Bloom’s Taxonomy hierarchically describes questions in 
terms of one of the six cognitive processes, including re- 
membering, understanding, applying, analyzing, evaluating, 
and creating, in increasing cognitive complexity [9]. It de- 
scribes the cognitive processes by which learners encounter 
and work with knowledge {1]. 


a real-world dataset instead of a high-level overview. [11] 
uses the k-means clustering algorithm to recommend learn- 
ing resources for instructors based on similar teaching styles 
among instructors. In addition to studying question type 
preferences, we approach the problem with a latent factor 
model instead of k-means clustering, yielding results that 
are more interpretable. 


The work in [16] compares several models in predicting learn- 
ers’ next-term grades using various features including in- 
structors’ job title, rank, and tenure status. Our work, on 
the contrary, uses data that contains instructors’ direct in- 
teraction with learning resources rather than simple demo- 
graphic information. 


2. LATENT FACTOR MODEL 

Let N, Q, K denote the total number of instructors, the 
total number of questions, and the total number of distinct 
Bloom’s Taxonomy tags, respectively. Let Y be the binary- 
valued matrix of dimension N by Q that represents instruc- 
tors’ preference for a particular course, where Y;; = 1 in- 
dicates instructor 2 explicitly denotes preference to exclude 
question j, and Y;; = O indicates no preference. Also let 
a; be a vector of dimension K that represents the question— 
Bloom’s Taxonomy tag association for question 7, where aj, 
denotes the kth component of a;. a;, = 1 indicates an as- 
sociation of question 7 with Bloom’s Taxonomy tag k, and 
aj~ = 0 indicates no association. 


With the above setup, we model Y as Bernoulli random 
variables: 


Yi; ~ Ber($(p; aj +g; hy)), (1) 
Where the function ¢(-) is the sigmoid function: 


(y= = 


meee 

In the model, p; € R“, g; € R”, h; € R™ are model pa- 
rameters to be estimated, where M is the dimension of g; 
and h; (we select the value of M via cross validation). Intu- 
itively, the latent factor p; represents the instructor Bloom’s 
Taxonomy tag preference vector that reveals instructors’ dif- 
ferent preferences on each Bloom’s Taxonomy tag. The la- 
tent factors g; and h; model additional factors that also 
contribute to explaining the observed data matrix Y. 


To compare the significance of the factor p; against the fac- 
tors g; and h;, we use two simplified variants of the full 
model in Equation 1, namely P Model that involves only 
the factor p;, and GH Model that involves only factors g; 
and h;: 


P Model: Yi; ~ Ber(¢(p; a;)) ee 
GH Model: Yi; ~ Ber(¢(g; h;)) (3) 


2.1 Optimization Algorithm 

We formulate the maximum-likelihood parameter estimation 
problem for the proposed model as an optimization problem. 
The optimization objective is given by 


minimize f(P,G,H), 


where P = |pi,...,pa] denotes the matrix of instructor 
Bloom’s Taxonomy tag preference associations by stacking 
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the association vectors together. G and H are defined anal- 
ogously. The cost function f(P,G,H) is given by 


N @Q 
f(P,G,H) =)“ log(1 + exp( - (pfaj + g1'h;)) ) 


t=1 g=1 


x vy pao 
2 2 2 
+5) lipills +5 > ligille + 5 > lille. 
t=1 1=1 G1 


The last three terms in the cost function are regularization 
terms added to prevent overfitting. A, y, and 7 are regular- 
ization parameters for the factors p;, gi, h;, respectively. 


The above optimization problem is non-convex, but the sub- 
problems to optimize over each parameter while holding the 
others fixed are convex. We therefore employ block coordi- 
nate descent to efficiently find a local minima for the above 
optimization problem by iteratively updating each param- 
eter in turn. The update equations for the parameters are 
given by 


new __ p?* _ ae old old h°"*) 


Pi — 4 Op; t 9S 9829 
new _ old _ 5 0 f new _old ho?) 
Si =8 Bee ( i 8 595 


new oO O new new oO 
h; = ho" = ooh; (p; »8i he's), 


where 6 is the step size. The gradients of the cost function 
with respect to each parameter are given by 


O = aj 
f(pi,8i,hj) =— >> 


ge ee 
Opi {te Pr aster hs) vie 
Q 
O h, 
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At the beginning of optimization, we randomly initialize the 
model parameters p;, gi, h; for all 7,7. In each optimiza- 
tion iteration, we first loop over all 2’s to update all p; and 
g; while holding all h,’s fixed, and then loop over all j’s 
to update h; using the newly calculated p;’s and g;’s. We 
repeat the above iterations until convergence, i.e., the differ- 
ence of the cost function between two iterations falls below 
a predefined threshold. 


Note that the inference problem for the P Model in Equa- 
tion 2 is convex, and optimization is straightforward via gra- 
dient descent. Since the GH Model in Equation 3 involves 
two sets of parameters and has a non-convex inference prob- 
lem, we employ the same block coordinate descent method 
as in the full model. 


2.2 Model Extensions 


We now enumerate possible extensions to the proposed model. 


First, we can incorporate additional prior information as la- 
tent factors in the model by simply including other modal- 
ities of meta-data as an additional inner product terms of 
two more latent factors inside the ¢(-) function. In this way, 
in each inner product term, one factor denotes the newly 


Table 1: Performance comparison between the pro- 
posed model and its variants, in terms of prediction 
accuracy (ACC) and area under operating charac- 
teristic curve (AUC). The proposed model achieves 
the best result among its two variants. The model 
involving the g; and h; factors achieves better per- 
formance than the model with the p; factor alone. 


Metrics 
Models ACC AUC 
Proposed Model 0.9033+0.0045 0.9592+0.0061 
P Model 0.8880+0.0047 0.8908+0.0064 
GH Model 0.9026+0.0048 0.9254+0.0058 


included meta-data modality, and the other characterizes 
the instructor’s exclusion preference in terms of that spe- 
cific modality of meta-data. Concretely, the extension of 
the model in Equation 1 has the following form: 


L 
Axe (o( doa . ew). «) 


[=1 


where we have replaced the inner product term pj; aj; in 
Equation 1 with a sum of ZL inner product terms. Each 
ui and v. model instructor and question association of a 
particular modality of meta-data. Additionally, the dimen- 
sions of ul and v5 can vary for different /’s depending on the 


mathematical representation of that meta-data modality. 


Next, it is easy to see that the same approach can be ap- 
plied to analyzing instructors’ preferences on other learn- 
ing resources. Although we specify in Equation 1 that Y;, 
represents instructor 2’s preference for question j, Yi; can 
naturally represent preferences to other contents types, by 
using j to index learning resources. Therefore, we can easily 
extend the proposed model in Equation 1 to analyze addi- 
tional instructor preference data with a different preference 
data matrix Y. 


3. EXPERIMENTS 


We now evaluate the prediction performance of the proposed 
latent factor model using a real-world educational dataset. 
We further showcase the interpretability of the model by 
visualizing the instructor Bloom’s Taxonomy tag preference 
vectors pj. 


3.1 Dataset 


We collect from OpenStax Tutor [13] 20 instructors’ pref- 
erences on all 896 questions of the textbook “Concepts of 
Biology” that these instructors use in their classes, resulting 
in a fully observed data matrix Y of dimension 20 by 896. 
About 15% of all entries in Y have a value of 1, meaning 
that an instructor explicitly indicates to exclude a question, 
and the rest 0, meaning that there is no such indication. 
We remind the reader that excluding a question in Open- 
Stax Tutor means that this question is excluded from the 
pool of questions that OpenStax Tutor selects from to as- 
sign to learners as personalized practice recommendations. 
We also collect the Bloom’s Taxonomy tag for each ques- 
tion, labeled by domain experts, as meta-data on the ques- 
tions. Since there are 6 distinct Bloom’s ‘Taxonomy tags 
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Table 2: Performance comparison between the proposed model and existing collaborative filtering methods 
in terms of the four metrics. The proposed model shows superior prediction performance compared to the 


other methods on all metrics. 


Models/Methods 
Metric Full Model UBCF IBCF FSVD 
ACC 0.9033+0.0045 0.8961+0.0048 0.8895+0.0048 0.8896+0.0045 
F-1 0.6483+0.0128 0.6007+0.0158 0.5696+0.0137 0.6185+0.0158 
Precision 0.7163+0.0222 0.7070+0.0214 0.6928+0.0254 0.6964+0.0236 
Recall 0.6153+0.0227 0.5226+0.0190 0.4954+0.0159 0.5661+0.0248 


Table 3: Comparison between p;, (second row for each instructor) and the percentage of questions they 
actually excluded under each Bloom’s taxonomy tag k (first row for each instructor), for selected instructors. 
The values of p; estimated by the proposed model closely resemble the actual number of questions each 


instructor excluded. 


Bloom’s ‘Taxonomy tag 


Instructor aa | a 
= 3 0.9% 1.6% 
_ 0.058 0.083 

5 16.9% 16.3% 

= 0.441 0.448 
;=9 63.1% 67.8% 
a 0.826 1.000 


k=3 k=4 k=5 k=6 
0.5% 1.8% 0.0% 0.0% 
0.038 0.216 0.075 0.084 
19.0% 55% 211% 33.3% 
0.501 0.360 1.000 —0.858 
72.4% 67.3% 42.1% 33.3% 
0.985 0.924 0.583 0.215 


in total, the dimension of the question—Bloom’s Taxonomy 
tag association vector a; is K = 6. The entries of a; cor- 
respond to Bloom’s Taxonomy tags in increasing levels of 
cognitive complexity, i.e., k = 1 represents “remembering”, 
k = 2 represents “understanding”, etc. Additionally, each 
question is only associated with one Bloom’s ‘Taxonomy in 
our dataset. Therefore, the values of a; satisfy a;, € {0,1} 
and >), ajx = 1 for all j. 


3.2 Experimental Setup 

We compare our model and its variants against three 
methods frequently used in recommender systems: user- 
based collaborative filtering (UBCF), item-based collabora- 
tive filtering (IBCF), and funk singular value decomposition 
(FSVD). UBCF and IBCF use similarities among users (in- 
structors) and items (questions), respectively, and predict 
a user’s preference on an item based on the preferences of 
most similar users or items. FSVD makes the observation 
that the actual number of user and item types is much lower 
than the number of users and items, and therefore utilizes 
a low-rank model to model user—item interactions [4, 5]. [7| 
explain the detailed implementations and evaluation meth- 
ods for UCBF, ICBF, and FSVD that we use in this paper. 


We use a total of five metrics for model evaluation: (i) pre- 
diction accuracy (ACC), (ii) precision, (iii) recall, (iv) F-1 
score, and (v) area under the receiver operating characteris- 
tic curve (AUC) of the resulting binary classifier [8]. Formu- 


las for calculating metrics (i) through (iv) are shown below: 


_ TP+TN 
ACC ~ TP+EP+TN+EN 
. . _ TP 
precision = TPLFP 
_ TP 
recall = TPLFN 
F-1 se DAY precision X recall 


precision+recall ’ 


where ‘I’P denotes true positive, T'N denotes true negative, 
FP denotes false positive, and FP denotes false negative. In 
the context of this paper, we treat preference for excluding 
a question, corresponding to Y;; = 1, as the positive class. 
True positive means predicting the positive class when the 
ground truth is also positive. False positive means predict- 
ing the positive class when ground truth is negative, and 
the rest follows. All metrics take on values in [0,1], with 
larger values indicating better prediction performance. We 
perform two sets of comparisons, one between the full model 
and its two variants (the P and GH models) evaluated on 
the ACC and AUC metrics, and the other one between the 
full model and UCBF, IBCF, and FSVD using ACC, F-1, 
precision, and recall. Since the AUC metric is only appro- 
priate for evaluating algorithms using probabilistic models, 
we do not evaluate the three CF methods that do not have 
an underlying probabilistic model. 


We perform 5-fold cross validation for model selection, i.e. 
choosing the best set of parameters for each model, and 
model assessment, i.e. evaluating the best model on the test 
set, according to the train-validation-test split paradigm. 
First, we randomly select 20% of all observed data and set it 
aside as test set. We then randomly partition the remaining 
80% of all data into four roughly equal-sized parts, fit the 
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Figure 1: 2D projection of instructor Bloom’s Tax- 
onomy tag preference vectors using multidimen- 
sional scaling and clustering using k-means that 
shows instructors’ diverse question exclusion pref- 
erences. Notice that instructors 3, 5, and 9 that 
we show to have very different question exclusion 
preferences also appear far apart in the plot. 


model to first three of the four parts, and validate the fitted 
model using the fourth part of the data to select the values 
of the regularization parameters using grid-search. Finally, 
we select the best performing model, fit it on all data except 
for the test set, and evaluate its performance on the test set. 
We perform 20 random partitions of the data, average the 
evaluation results, and compare the best evaluation results 
of each method. 


3.3. Results And Discussions 

Table 3 shows results for the full model, UBCF, IBCF, and 
FSVD evaluated on the ACC, F-1, Precision, and Recall 
metrics. The relatively lower Recall scores of the full model 
compared to its ACC suggests that the proposed model still 
exhibits some albeit less tendency to avoid assigning an ex- 
clusion preference label than other methods. Nevertheless, 
comparing across columns, we see that the performance of 
the full model, regardless of the choice of metric, is signifi- 
cantly better than the rest of the models, showing promise 
for the proposed latent factor model in predicting instruc- 
tors’ question exclusion preferences. 


Table 1 shows prediction performance results for the full 
model and its two variants evaluated on the ACC and AUC 
metrics. From the table, we observe that the full model 
achieves the best performance on both metrics. Further in- 
spection of the results of the two variants reveals that the 
GH Model, which involves factors g; and h;, achieves better 
results for both metrics than the P Model, which involves 
only factor p;. This implies that besides Bloom’s Taxon- 
omy, additional factors are needed in the latent factor model 
to better characterize instructors’ question exclusion pref- 
erences. Even though Bloom’s Taxonomy contribute only 
moderately to the prediction performance, the purpose of 
explicitly incorporating Bloom’s Taxonomy, as stated ear- 
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Figure 2: Heatmap visualization of the cluster cen- 
ters that shows the radically different question ex- 
clusion preferences of each cluster of instructors. 


lier, is the power of interpretability it brings to the proposed 
model, which we demonstrate below. 


First, we use the instructor Bloom’s Taxonomy tag associ- 
ation vectors to interpret how instructors prefer to exclude 
certain questions in terms of Bloom’s ‘Taxonomy. Table 3 
presents a comparison between the numerical values of en- 
tries in the instructor Bloom’s Taxonomy tag preference vec- 
tor p; and the percentage of questions that the correspond- 
ing instructor excludes with each Bloom’s Taxonomy tag, 
for a selected subset of instructors 7 € {3,5,9}. Comparing 
the values in the two rows for each instructor 2 in the table, 
we observe that higher values of p;, correspond to a higher 
percentage of the questions of Bloom’s ‘Taxonomy tag k that 
the instructor excludes. Therefore, p;, reflects the degree to 
which instructor 2 prefers to exclude questions with Bloom’s 
Taxonomy tag k. For example, we observe from the second 
row of instructor 5 that values of p;, are high for k = 5 
and k = 6, indicating that this instructor strongly prefers 
to exclude questions that involve more complex cognitive 
processes such as evaluating and creating. Second, the in- 
structor Bloom’s ‘Taxonomy tag preference vectors uncover 
differences and patterns in instructors’ Bloom’s Taxonomy 
tag preferences. Comparing the second row of all instructors 
in Table 3, we see distinct preferences for different instruc- 
tors. For example, values of pix for instructor 9 are high for 
k = 1,2,3,4, indicating that this instructor strongly prefers 
to not assign questions that involve simpler cognitive pro- 
cesses such as remembering, understanding, applying and 
analyzing. Such preferences are opposite to those for in- 
structor 5. Moreover, instructor 3 exhibits no obvious ex- 
clusion preference for any Bloom’s Taxonomy tags by noting 
the small values of p;, for 2 = 3, setting this instructor apart 
from both instructors 5 and 9. 


We further visualize patterns in instructors’ question prefer- 
ences after projecting each p; onto a 2-dimensional plane us- 
ing multidimensional scaling [3]. We then run the K-means 
algorithm to group the instructors into 3 clusters. Figure 1 
plots each p; as a point in the 2-dimensional space, where 
the color of the point denotes the cluster that the point 
belongs to. The figure shows obvious clustering patterns, 
which means that instructors exhibit only a few patterns on 
their Bloom’s ‘Taxonomy tag preferences. Note that instruc- 
tors 3, 5 and 9 are far apart in the figure and belong to differ- 
ent clusters. Figure 2 presents a heatmap visualization of the 
cluster centers that shows distinct Bloom’s Taxonomy pref- 
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erences across the three instructor clusters. For example, the 
first and third clusters demonstrate almost entirely opposite 
Bloom’s Taxonomy preferences, where the first cluster tends 
to exclude questions with more complex cognitive process, 
whereas the third cluster tends to exclude questions with 
simpler cognitive processes. On the other hand, the second 
cluster does not exhibit strong exclusion preferences for any 
particular Bloom’s ‘Taxonomy tag. Such clustering could 
help a PLS to recommend questions to an instructor that 
they might want to exclude, based on instructors that have 
demonstrated similar Bloom’s ‘Taxonomy preferences. 


4. CONCLUSIONS AND FUTURE WORK 


We have presented a latent factor model that predicts in- 
structors’ question preferences, and explicitly incorporates 
questions’ Bloom’s ‘Taxonomy tags to improve model inter- 
pretability. Evaluated on a real-world educational dataset, 
our proposed model shows superior prediction performance 
over popular collaborative filtering methods frequently used 
in recommender systems. Additionally, we demonstrated 
model interpretability by showing that the Bloom’s ‘l'axon- 
omy captures each instructor’s question preferences reason- 
ably well, and also visualized different Bloom’s Taxonomy 
preference patterns across instructors. ‘These encouraging 
results show the promise of using latent factor approach for 
instructors’ content preferences modeling to 1) potentially 
automate the question exclusion process in OpenStax Tutor, 
and 2) more broadly, to improve various aspects of personal- 
ized learning systems such as intelligent content recommen- 
dation that takes into account of instructors’ preferences. 


To achieve these goals, the following avenues of future re- 
search seem appropriate. First, we used only one source 
of meta-data, i.e., Bloom’s ‘Taxonomy tags, in the proposed 
model. We have shown that the proposed model is easily ex- 
tendable to accommodate additional meta-data; moreover, 
the performance comparison between the P Model and the 
GH Model shows the need to incorporate additional factors. 
Therefore, we plan to extend the proposed model to include 
other sources of meta-data, such as the textbook chapter 
or section that each question belongs to, to improve both 
prediction accuracy and model interpretability. Second, we 
focused on instructors’ preferences in a very specific content, 
i.e., question exclusion. We are interested to see how well the 
proposed modeling approach can be adapted to analyze in- 
structors’ preference for other learning resources. Third, we 
also plan to expand our experiments from a single textbook 
to multiple textbooks and domains, in order to validate the 
proposed approach for analyzing instructor preferences on a 
wide range of contents and across different subject domains. 
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ABSTRACT 


Augmented Graph Grammars are a graph-based rule for- 
malism that supports rich relational structures. ‘They can be 
used to represent complex social networks, chemical struc- 
tures, and student-produced argument diagrams for auto- 
mated analysis or grading. In prior work we have shown 
that Evolutionary Computation (EC) can be applied to in- 
duce empirically-valid grammars for student-produced argu- 
ment diagrams based upon fitness selection. However this 
research has shown that while the traditional EC algorithm 
does converge to an optimal fitness, premature convergence 
can lead to it getting stuck in local maxima, which may 
lead to undiscovered rules. In this work, we augmented the 
standard EC algorithm to induce more heterogeneous Aug- 
mented Graph Grammars by replacing the fitness selection 
with a novelty-based selection mechanism every ten genera- 
tions. Our results show that this novelty selection increases 
the diversity of the population and produces better, and 
more heterogeneous, grammars. 


Keywords 

Heterogeneous Rules, Augmented Graph Grammars, Argu- 
ment Diagrams, Evolutionary Computation, Novelty selec- 
tion 


1. INTRODUCTION 


Intelligent tutoring systems, social-networking systems, and 
computer-supported collaborative platforms have grown in- 
creasingly prevalent in education (e.g. Pyrenees [15], LASAD 
[8], and CSCL [13]). Consequently, researchers have be- 
gun to collect large repositories of complex relational data 
representing student-produced conceptual or structural dia- 
grams [8], structured user-system interaction logs [15], and 
personal relationships [13]. Researchers have generally an- 
alyzed this data via standard network analysis tools and 
gestalt relationships which allow us to assess general topo- 
logical graph structures but which do not focus on individual 
graph features or graph rules (e.g. [15, 13]). 
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One of the primary goals of Graph-based Educational Data 
Mining is to automatically identify substructures that can 
reveal vital pedagogical information in graph data. These 
features include good sub-solutions and structural flaws in 
students’ solutions, which can be used for automated guid- 
ance and grading [10]. Prior research has demonstrated that 
we can use hand-authored graph rules to evaluate student- 
produced argument diagrams [10]. But, hand-authored rules 
are expensive and time consuming to generate and do not 
always generalize well to novel contexts. Existing general 
purpose graph rule induction algorithms (e.g. [16, 2]) have 
limitations and are unsuited to the induction of generalized 
rules that use negation or other hierarchical elements [17]. 


Evolutionary Computation (EC), on the other hand, is both 
flexible and robust enough to induce complex graph struc- 
tures and to deal with rich graph data. We have previously 
shown that EC can be used to automatically induce positive 
and negative graph rules for student-produced argument di- 
agrams through fitness selection [17]. The induced rules can 
be used as features to provide hints for argument writing, 
and to detect structural flaws. Prior research also indicates 
that the induced graph rules from EC outperform all but one 
of the expert hand-authored rules and they outperform all 
of the rules induced by two general purpose graph grammar 
induction algorithms, Subdue [2] and gSpan [16]. However, 
prior research has shown that, while the traditional EC al- 
gorithm does converge to an optimal fitness, the premature 
convergence can lead to it getting stuck in local maxima, 
which may lead to undiscovered graph rules [6]. 


In this work, we augmented the standard EC algorithm to 
produce more heterogeneous Augmented Graph Grammars 
that can reflect innovative structures in student-produced 
argument diagrams. ‘To that end, we incorporated a novelty 
selection mechanism into our EC system that was designed 
to enforce population diversity. The goal of this diversity 
was to explicitly retain novel introns and thus to reward 
the basic stepping stones of evolution both in the internal 
(genospace) and the external application space (phenospace), 
respectively. In this work, we experimented with two differ- 
ent novelty selection mechanisms: novel genotype selection 
and novel phenotype selection. Our research hypotheses is 
that novelty selection will increase the diversity of the popu- 
lation and will produce better and more heterogeneous graph 
grammars when compared with pure fitness selection. 
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2. BACKGROUND 


2.1 Argument diagrams 

Argument diagrams are graphical representations for real- 
world argumentation that reify the essential components of 
arguments such as hypotheses statements, claims, and cita- 
tions as nodes and the supporting, opposing, and clarifica- 
tion relationships as arcs [11]. These complex elements can 
include text fields describing the node and arc types or free- 
text assertions, links to external resources and other data. 


A sample student-produced diagram is shown in Figure 1. 
The diagram includes a hypothesis node at the bottom right, 
which contains two text fields, one for a conditional or if 
field, and the other for a consequent or then field. ‘Two 
citations are connected to the hypothesis via supporting and 
opposing arcs colored green and red, respectively. ‘They are 
also connected via a comparison arc. Each citation contains 
two fields: one for the citation information and the other 
for a summary of the work. Each arc has a single text field 
explaining what purpose the relationship serves. 


urn Clains tha, widespread use of ‘he MM vaccnne 
is responsible for increading incidence of autism. 


Comparison - 
36 (>) amp z es 


"Madsen claims a link between a i) 
particular vaccine and autism L 
while Bae caims link not proven Cc itat i re) n 
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her! | They willnot cause 
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Figure 1: A student-produced Argument Diagram. 


2.2 Augmented Graph Grammars 

Augmented Graph Grammars (AGGs) are a graph-based 
rule formalism that supports rich relational structures [9]. 
AGGs are an extension of traditional graph grammars, which 
are composed of standard graph elements including ground 
nodes, ground arcs, and variable arcs which can match mul- 
tiple items. In addition to these basic features, AGGs also 
support: complex node and arc types that contain sub- 
elements; negated elements which select for the nonexistence 
of subgraphs; generalized node and arc types which match 
multiple items; complex element constraints which allow us 
to compare individual elements; complex graph expressions 
which allow for universal and existential quantification; and 
the incorporation of NLP rules or other external constraints. 
As such they are an ideal rule representation for the analysis 
of argument diagrams. 


In prior work [10, 11], we collaborated with a group of do- 
main experts to define a set of 77 a-priori argument rules 
encoded as grammars. These rules were designed to iden- 
tify individual features of argument diagrams or sub-graphs 
that were consistent with high quality argumentation or 
which represented common structural flaws. We have shown 
that these hand-authored graph rules are correlated with 
the student-produced argument diagram grades and essay 
grades and they are empirically valid and can be used as 
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Figure 2: A hand-authored Augmented Graph 


Grammar. 


the basis for predictive models of student grades. A sample 
hand-authored rule is shown in Figure 2. This rule is de- 
signed to identify cases where students use a citation a to 
oppose a claim or hypothesis node ¢ via an opposing path 
O, and use the other citation b to support the node t via a 
supporting path S, however, the students do not include a 
comparison arc c between two citations a and 0. 


2.3 Evolutionary Computation 

Evolutionary Computation (EC) is a general machine learn- 
ing algorithm based upon Natural Selection. The algorithm 
starts with a population of candidate solutions, which may 
be generated at random or user-defined. The individual so- 
lutions are assessed by an objective measurement known as 
the fitness function. Subsequent generations are produced 
by a combination of elétzsm in which very fit individuals are 
cloned into the next generation, and fitness-proportional re- 
production in which individuals are copied over with direct 
mutations or through crossover with other members in the 
population. The EC algorithm proceeds iteratively until a 
given fitness threshold is reached or until a fixed number 
of generations has passed. When compared with existing 
graph grammar induction algorithms, EC is much more flex- 
ible and robust. The behavior of the system is determined 
by the user-defined solution representation, fitness function, 
and the genetic operators including mutation and crossover. 


In prior work, we applied EC to automatically induce a set 
of AGG rules on student-produced argument diagrams [17]. 
The induced rules support disjoint subgraphs, negation, and 
generalized elements. In that work, the solution representa- 
tion was an individual graph rule. The fitness of each graph 
rule was accessed via Spearman’s Rank Sum Correlation (p) 
[3] between the frequency with which a rule matches a di- 
agram, and the argument grades. The mutation in the EC 
algorithm was basic point mutation that can add, delete, or 
modify existing nodes and arcs. Crossover was implemented 
using matrix crossover based upon the work of Stone, Pill- 
more, & Cyre [14]. 


2.4 Novelty Selection 


Absolute fitness functions of the type that we used in our 
prior studies, are designed to reward individual progress to- 
ward an absolute objective in the search space without con- 
sideration for the population as a whole. Prior studies have 
shown that although the fitness function is driven to con- 
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verge to a fitness optimum, the objective function sometimes 
suffers from the pathology of local optima [6]. This is be- 
cause the objective function only rewards improvements in 
performance with respect to the static objective, it does not 
necessarily reward diversity in the search space that can ul- 
timately lead to other solutions. One approach that EC 
researchers have taken to address this problem is Novelty 
Selection that is, explicitly incorporating population diver- 
sity into the fitness metric or supporting diverse solutions 
irrespective of the fitness value [1, 5]. The goal in doing 
so is to encourage the development of good sub-solutions or 
stepping stones that can support novel solutions and avoid 
local optima. 


Current novelty selection algorithms fall into one of two 
broad categories: novel genotype selection, or novel pheno- 
type selection. In EC, the genotype of a solution is the basic 
solution structure or code that defines the solution, which 
corresponds to the set of genes in a real organism. The phe- 
notype, by contrast, is the observed behavior of the solution 
when it is evaluated. In the context of our work, the geno- 
type is the AGG structure while the phenotype is the way 
in which the rule maps to the graphs in our dataset. Thus 
the genotype is fixed while the phenotype is data-driven. 


The novel genotype selection is focused on finding individu- 
als that have a unique structure relative to the remainder of 
the population. Prior researchers have focused on applying 
user-defined metrics to calculate pairwise distances between 
members of the population [4, 1]. The metrics are neces- 
sarily representation specific. Maximally-unique individu- 
als are then selected for reproduction or cloning in order to 
maintain genetic diversity. The primary shortcoming of this 
method is that computing pairwise distance can be computa- 
tionally intractable (e.g. comparing neural networks which 
is NP-Hard) [5]. 


While novel genotype selection seeks individuals with unique 
genes, novel phenotype selection rewards individuals that be- 
have differently according to some separate evaluating met- 
ric. This is usually based upon some user-defined distance 
function based upon prior knowledge of the domain. The 
goal of the metrics is to enforce coverage of the solution space 
and, as with the genotype selection, maximally unique indi- 
viduals are selected for retention. The primary disadvantage 
of this approach is that given two individuals with compara- 
ble behavior but distinct genes we will discard one and will 
potentially lose good evolvable genes in the process [5]. 


3. METHODS 


In order to compare the performance of novelty selection 
with traditional objective fitness selection, we implemented 
two novelty selection methods in EC with one rewarding 
novel rule structures (genotype) and the other rewarding 
rules that match a unique set of graphs in our dataset (phe- 
notype). For the former metric, we select the novel rules 
according to the diversity score, which is calculated using 
a greedy graph-matching algorithm; for the latter one, the 
novel rules are rewarded based on the behavior score using 
the x test[3]. A large diversity or behavior score indicates 
that the specified rule is substantively different from the rest 
of the population. 
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3.1 Genotypic Distance - Diversity Score 

We define the diversity score of an individual as its aver- 
age genotypic distance from the remainder of the popula- 
tion. In order to compute this score, we developed a greedy 
graph matching algorithm that computes the distance based 
upon local-neighborhood similarity. The root intuition be- 
hind this algorithm is that if two graph grammars Go and G1 
are isomorphic then it should be possible to automatically 
align their local neighborhoods (individual nodes plus im- 
mediate neighbors). The algorithm returns a distance score 
between O and 1 inclusive. Here 0 means that the two gram- 
mars are completely isomorphic and 1 indicates they are 
wholly distinct from one another. The algorithm operates 
as follows: 


First, we count the total number of nodes n in both gram- 
mars on a per-type basis. For example, Figure 3 shows two 
graph grammars Go and G,. They have a total of 6 nodes 
of 5 types (A, B,C, D, E) and 4 arcs of 2 types (1,2). For 
category A, Go has one A node (Ao), while Gi has two (Ao 
& Ai), so Na = max(2,1) = 2. For the remaining types 
B,C,D and E, we have np = ne = ng = Ne = 1, and the 
total number of nodes n is 6. 


(Go) . e \ (G1) x / 
- \. a o C 


Figure 3: Example of two graph grammars with five 
categories of nodes (A, B,C, D, E,) and two categories 
of arcs (1,2). 


Second, we compute the individual similarity score S = 
{$1, $2, $3,..-, Si,-.-, $n} for i € {0,n}, where s; indicates 
the similarity score for node N;. For nodes of the same 
type, we use greedy search to find the best match for each 
node and then update the maximum similarity score of the 
whole grammar. The value of s; is between -1 and 1, and is 
computed by the following formula: 


—1 if NiinGoorG; (1) 
s,= # of shared neighbors 
total # of neighborsin Go and Gl 


otherwise. (2) 


where s; = —1 means that node JN; is in either Go or G1 but 
not both; s; = 0 indicates that node N; is in both graphs, 
but they do not share any neighbour at all; s; = 1 indicates 
that node N; is in both graphs and they share the same 
neighbor(s) with the same arc(s). Note that if two nodes 
share a same neighbour but with different arcs, we do not 
count it as the same neighbour. 


In the example shown in Figure 3, we have S = {sh, 5 Si Gas 
Sa, Se}. For A nodes, if we match Ap € Go with Ao € Gi, 


we have sg = a if we match Ap € Go with A; € G4, Sa 
is 0. Thus, the best match for Ap € Go is Ao € Gi and 
1 


update for s, = 5. Now for Ai € Gi, we cannot find any 


node to match with, so s2 = —1 using Equation (2). For the 
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B nodes, B is present in both graphs, they share the same 
neighbour (A) with the same arc type of (1), so s, = 1. 
Similarly C’ nodes are present in both graphs, but they do 
not share any neighbours because C’ € Gy; is isolated, so 
a ° = 0. For D and E, we have sq = s. = —1 because 
node D and F is just shown in one of the two graphs. Thus 


we have S = {5,—1,1,0,—1,—1}. 


Finally, we use Euclidean distance to normalize the simi- 
larity scores to a distance score within a range of [0,1] by 
Equation (3). Then the diversity score for an individual is 
the average distance score to the remaining population. 


p= jena (3) 


n * 22 


3.2 Phenotypic Distance - Behavior Score 

The behavior score of an individual is the average pheno- 
typic distance between it and the remainder of the popula- 
tion. We use a data-driven definition of behavior. For each 
individual we define its behavior signature as a vector of pos- 
itive integers representing the number of distinct subgraphs 
that it matches for each of the 104 graphs in our dataset. 
We then calculate the pairwise distance between individuals 
using the x7 test of independence [12]. y7 is a statistical test 
that measures divergence from the expected distribution as- 
suming that one feature occurs independently of the others. 
It is often applied to evaluate the independence of two vari- 
ables in mathematical statistics [7]. The null hypothesis of 
this test is that two variables are wholly independent. A p- 
value < 0.05 of x? test leads us to reject the null hypothesis 
and conclude that the variables are significantly correlated. 


If two frequency sets are statistically independent from one 
another other according to the x” test then we assign a phe- 
notypic distance score as 1 indicating that the grammars are 
independent. If, however they are dependent then we assign 
a score of 0, meaning that the grammars are substantively 
similar given our dataset. We then calculate the average 
score for each individual to indicate its relative uniqueness 
within the population. 


3.3. Dataset 


For this study we used a dataset of 104 argument diagrams 
that was originally collected at the University of Pittsburgh 
in a course on Psychological Research Methods [10, 11]. The 
subgraph shown in Figure 1 was collected as part of this 
study. Students in the course were instructed to plan their 
written arguments graphically using LASAD, an online tool 
for argument diagramming and collaboration [8], and then 
to produce written essays. The diagramming ontology con- 
tained four types of nodes: citation, claim, current study 
and hypothesis; and four types of arcs: supporting, oppos- 
ing, comparison, and unspecified. Current study nodes are 
used to represent factual information about the study such 
as the target population. Unspecified arcs represent cases 
where nodes provide clarification or concept definitions. At 
the end of the study, 104 paired diagrams and essays were 
collected. These diagrams and essays were graded by an 
experienced TA according to a parallel grading rubric. 
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4. EXPERIMENTS 


In this work, we evaluated the impact of novelty selection 
on graph grammar induction by comparing the two types of 
novelty selection to a traditional objective-fitness approach. 
We ran three experiments to induce three sets of graph 
grammars using the different selection functions. The three 
experiments are Baseline, Geno, and Pheno respectively: 


Baseline: we used traditional fitness function at each gener- 
ation. The fitness function measures the correlation between 
the observed graph rule frequency and diagram grades. 


Geno: we replaced the fitness function with novel genotype 
selection on every tenth generation. The novel genotype 
selection rewards grammars with novel structure for further 
evolution by cloning them to the next generation. 


Pheno: we used the novel phenotype selection to reward 
graph grammars that have significantly different behaviours 
to the remaining population in every tenth generation. 


For each experiment, we conducted a series of three evolu- 
tionary runs to explore the search space. In each run, we set 
a population size of 100 individuals and ran for 500 gener- 
ations. The initial populations were composed of randomly 
generated grammars each of which contained between 3 and 
10 elements. ‘The nodes and arcs were all ground elements 
and were selected from a predefined ontology of basic types 
that matched the argument diagram ontology. The fitness 
function, crossover and mutation operators were the same 
as in our prior work discussed in section 2.3. On each evo- 
lutionary run, we harvested all graph grammars generated 
over the course of the run whose performance exceeded a 
threshold of (p > 0.18) and preserved them for later anal- 
ysis. The threshold was chosen based upon a series of ex- 
ploratory studies which showed that p values at or above 
this threshold were statistically significant. 


5. RESULTS & ANALYSIS 


After collecting the three sets of grammars, we applied the 
graph matching algorithm discussed in section 3.1 to identify 
the isomorphic rules, we then filtered the overlapping rules 
to obtain the unique rule sets. Table 1 shows the number of 
unique rules collected from each experiment along with the p 
values for the top three rules in each unique rule set. The top 


Table 1: The number of unique rules above the 
threshold (p > 0.18) and the Spearman’s Correlation 
value p for the top three best rules 


Experiments maigue ee 

rules 1st 2nd 3rd 
Baseline-Only 37 0.282 0.279 0.260 
Geno-Only 112 0.348 0.334 0.325 
Baseline 1 Geno 146 0.371 0.369 0.362 
Baseline-Only 26 0.282 0.260 0.254 
Pheno-Only 99 0.348 0.334 0.333 


Baseline ™ Pheno 157 0.371 0.369 0.362 
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Figure 4: Best performing graph rule in Geno Only 
and Pheno Only with correlation (p = 0.348). 
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Figure 5: Best performing rule in EC experiment 
with the correlation (p = 0.371). 


three rows display the rules that are unique to the Baseline 
and Geno experiments along with the the overlapping rules 
shared between them (Baseline M Geno ). The bottom three 
rows show the rules that are unique to the Baseline and 
Pheno experiments, and the overlapping rules between them 
(Baseline M Pheno). 


As Table 1 indicates, after removing the isomorphic rules, 
the Geno and Pheno experiments still produced a large num- 
ber of high-performing rules with Geno-Only having 112 
unique rules and Pheno-Only having 99. The top three per- 
forming rules in Geno- and Pheno-Only outperform the rules 
in both the Baseline-Only. After examining these rules, we 
found that the top two rules in Geno- and Pheno-Only are 
isomorphic with the same performance and the best rule is 
shown in Figure 4. This rule contains 6 nodes with two ci- 
tations (cO & cl) supporting two claims (kO & k1) and two 
isolated nodes, one hypothesis (h) and one citation (c2), 
which may or may not be connected to the remaining struc- 
ture. This reflects an argument diagram where the students 
have two solid claims supported by different citations and 
where they include both a hypothesis and at least one other 
additional supporting citation. This rule captures another 
highly correlated feature in the student-produced argument 
diagrams that two claims are supported by two different ci- 
tations. 


The top three rules in Baseline MN Geno and Baseline M Pheno 
outperform the rules in both Baseline-Only and the rules in 
Geno- and Pheno-Only. We also found that these three best 
rules are isomorphic with the same performance, meaning 
that all three fitness models are capable of identifying the 
best performing rules on our dataset. Figure 5 shows the 
best graph rule with the correlation (9 = 0.371). It repre- 
sents a rule with 5-nodes, two of which are citations (cO & 
cl) that support a shared claim node (k0). The remaining 
nodes consist of a single claim (k1) and hypothesis (h) which 
may or may not be connected to the other elements. This 
reflects a graph where the authors identified at least two re- 
lated citations that can be synthesized to support a single 
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Figure 6: Example graph rules with unique struc- 
tures. B-N-G: unique rule in Baseline with cor- 
relation (p = 0.280); G-N-G: unique rule in Geno 
experiment with correlation (9 = 0.197); P-N-G: 
unique rule in Pheno experiment with correlation 
(p = 0.182). 


claim and where they included both a hypothesis and an- 
other claim. This is one of the structures that students have 
been encouraged to make in their arguments as it shows an 
ability to synthesize citated work to form a complex claim. 


We also investigated the unique structures that were specific 
to each experiment. The structure refers to the sub-graph 
within a graph rule but without isolated node(s). When 
comparing the Baseline and Geno experiments, we found 
three unique structures that only show up in the Baseline 
experiment and six in Geno. When comparing the Baseline 
and Pheno experiments, we identified three unique struc- 
tures in the Baseline experiment and four in the Pheno ex- 
periment respectively. 


Figure 6 shows three example graph rules with unique struc- 
tures in each experiment. B-U-G is a unique rule induced 
in the Baseline experiment, it matches cases where two ci- 
tations (cO & cl) support two claims (kO0 & k1) and are 
connected via a supporting arc ($2) and where an isolated 
hypothesis (h) may or may not be connected to the remain- 
ing structure. This rule reflects a very interesting argument 
structure where the student used one citation to directly 
support a claim and the other citation to support this claim 
with another intermediate claim. G-U-G shows rule that 
was induced in the Geno experiment. It has one citation 
(c) that supports a claim (k0) which in turn supports a hy- 
pothesis (h). This citation is also connected to a claim (k1) 
with an unspecified arc (uw). And it has an isolated claim 
(k3) which may or may not be connected to the remainder 
of the structure. This rule indicates another innovative use 


300 


of chaining support which students were encouraged to use 
and which is comparable to B-N-G. 


P-U-G shows a graph rule from the Pheno experiment, it 
contains a connected structure with four arcs, and is the 
most complex rule above the threshold. ‘This connected 
structure has two citations with one supporting another (c0 
& cl) and then jointly supporting a shared claim (k) which 
in turn directly supports a hypothesis (h). The rule also 
contains an isolated citation (c3) which may or may not con- 
nect to the remaining structure. Conceptually this indicates 
a case where a grounded claim supports a research hypothe- 
sis. In the real word, it indicates that the author sought out 
closely-related sources of literature or noted important con- 
nections between them, then used this well-supported claim 
to support a research hypothesis, something which they had 
been encouraged to do in class. 


6. CONCLUSION AND FUTURE WORK 


In this work, we augmented the standard EC with two nov- 
elty section methods to induce Augmented Graph Gram- 
mars on student-produced argument diagrams by replacing 
the fitness function with a novelty selection function every 
ten generations. ‘This novelty selection promotes diversity 
in the population by explicitly encouraging the production 
and maintenance of novel stepping stones or partial solutions 
in the genotypic and phenotypic spaces. Our experimen- 
tal results indicate that, when compared to pure objective- 
fitness selection, the novelty-selection functions produced 
more heterogeneous and better-performing graph grammars. 
The unique rules that were induced by each experiment re- 
flect some novel features in student-produced argument di- 
agrams. The significance of this work is that the novelty 
selection can enhance EC to produce more empirically-valid 
rules that can be used for automatic grading. 


In future work, we plan to work with domain experts to de- 
termine whether the rules are semantically valid, and whether 
or not they can serve as the basis for automatic hinting. We 
will also build an intelligent argument grading system to au- 
tomatically grade and provide feedback on student-produced 
argument diagrams based on the induced graph grammars 
and other argument diagram features. 
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ABSTRACT 


This paper discusses a novel approach for developing more 
refined and accurate learner models from student data col- 
lected from Open Ended Learning Environments (OELEs). 
OELEs provide students choice in how they go about con- 
structing solutions to problems, and students exhibit a va- 
riety of learning behaviors in such environments. Building 
accurate models from limited amount of student data is dif- 
ficult; to address this we develop a methodology that uses 
Monte Carlo Tree Search methods to boost the initial set of 
student action sequences in such a way that we can learn 
more accurate models of students’ learning behaviors. We 
use a HMM representation to model students’ learning beha- 
viors and demonstrate the effectiveness of our approach by 
running a case study on data collected from 98 students, who 
worked with the Betty’s Brain system for four days. The re- 
sults have interesting implications for learner modeling and 
its applications to adaptive scaffolding of students’ learning 
behaviors and strategies as they learn from OELEs. 


1. INTRODUCTION 


In recent work on computer-based STEM learning environ- 
ments, there has been a focus on developing OELEs, which 
provide students with a learning goal, usually in the form of 
a complex problem or a modeling task, and a set of tools that 
support the problem-solving /modeling task [1]. To succeed, 
these students need to make choices on how to structure the 
solution process, explore alternative solution paths, develop 
awareness of their own knowledge and problem-solving skills, 
and develop strategies that support more effective learning 
and problem solving [2]. 


Given the complexities students face in working with OE- 
LEs, it is imperative that effective scaffolding be provided 
to help them progress in their learning and problem solving 
tasks and achieve their learning goals. However, an impor- 
tant component of effective scaffolding is learner modeling 
that can accurately capture students’ cognitive and meta- 
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cognitive processes. In this work, we take on the challenge 
of using data-driven techniques to construct accurate mo- 
dels of learner behaviors and performance by analyzing the 
learners’ activity data from OELEs. 


Typically, data-driven methods require large volumes of rich 
data to support accurate and robust learner modeling. Ho- 
wever, collecting such data from OELEs, especially in K-12 
settings can be a difficult, time consuming process. To allevi- 
ate this problem, we propose a novel set of techniques that 
combine the use of Hidden Markov Modeling (HMM) [7], 
Monte Carlo Tree Search (MCTS) [3], and a reinforcement 
learning methodology [4] to generate artificial student acti- 
vity data that simulates students behavior corresponding to 
learning activities captured in the log data. The original 
student data combined with the artificially generated data 
is then used to derive more accurate and complete models 
of students’ behaviors and strategies used for learning. 


In section 2, we briefly review the Betty’s Brain OELE that 
we use for this work, and describe the overall learner mo- 
deling approach as well as the two more important techni- 
ques that we employ, i.e., HMMs and MCTS. Section 3 pro- 
vides experimental results and evaluations of our learner mo- 
deling method by comparing analysis results of original data 
with data generated post-reinforcement learning. Section 4 
presents the discussion and conclusions. 


2. BACKGROUND 

We implement the learner modeling methods starting from 
data collected from student work in the Betty’s Brain OELE. 
Betty’s Brain is a learning by teaching environment, where 
students utilize tools for information acquisition, solution 
construction and solution assessment to teach a virtual cha- 
racter named Betty by constructing a causal map [5]. The 
primary student actions in the Betty’s Brain environment 
can be categorized as: 


Information Acquisition (IA): It relates to actions, such 
as reading to learn new information (read) and searching for 
specific knowledge search. Taking and viewing notes is also 
considered to be useful for information acquisition (notes). 


Solution Construction (SC): In Betty’s Brain, SC acti- 
ons are causal map editing actions (mapedit), which in- 
clude addition and deletion of concepts and adding,deleting 
or changing links in the causal map. 
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Solution Assessment (SA): It consists of asking Betty to 
take a quiz( quiz); answer questions (query); and to explain 
how she derived her answers using qualitative reasoning met- 
hods (expl). Besides, students can mark correctness of links 
that have been added to assist their solution assessment. 


Students’ performance is based on a map score that is com- 
puted by comparing their causal models with a pre-specified 
expert model. In our study, the expert model had 15 links, 
which implies that the students could achieve a max map 
score of 15. At any time, the students’ map score is com- 
puted by number of correct links minus number of incorrect 
links in their constructed (partial) maps. Next, we describe 
the learner modeling approach applied to Betty’s Brain. 


2.1 General Approach 

Figure 1 illustrates the general approach that we have de- 
veloped for our learner modeling method. As a first step, 
we apply a HMM clustering method [6] that divides the stu- 
dent’ behaviors into groups of similar behaviors. We then 
iteratively generate a more accurate HMM model for each 
group by running a MCTS algorithm that combined with 
a reinforcement learning approach to produces a number of 
additional student behavior sequences that provides more 
coverage of the students’ learning behaviors. These additio- 
nal sequences when combined with the original student data 
is used to learn a new HMM model that we believe is a more 
complete description of the students’ learning behaviors. 
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Figure 1: Architecture of the Overall Approach 


2.2 HMM applied to Learner Modeling 

A HMM is defined as a tuple, i.c., 1 = {A,B,z}, where 
A and B represent state transition probability distribution 
and emission probability distribution matrices, respectively, 
while 7 is the initial state probability distribution [7]. Fi- 
gure 2 presents the state diagram of a simple HMM exam- 
ple trained on two action sequences 5S; and S2 with only 4 
action types. Although not explicitly shown in the action 
sequences, the hidden states h1 and hz can be interpreted as 
IA state (searching for and reading resources) and SC state 
(editing concept entities and causal links) respectively. 


Based on the different probability distribution for each ob- 
servation (action), the hidden states can be labeled by the 
primary actions associated with that state. ‘The transitions 
between states capture changes in student activities over 
time, as also frequent patterns of activities, e.g., frequent 
occurrence of information acquisition followed by solution 
construction patterns. 


2.3. Reinforcement Learning using MCTS 
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Figure 2: Simple HMM example. 


To learn accurate and robust HMMs, it is important that the 
data set cover the range of behaviors a student exhibits in 
sufficiently large numbers.. However, given that we have li- 
mited student activity data on the system, we suffer from the 
data impoverishment problem. To address this problem, we 
propose a novel reinforcement learning method using Monte 
Carlo Tree Search (MCTS) and combine it with an initially 
derived HMM model to generate artificial data that matches 
students’ learning behaviors. For generating action sequen- 
ces that simulate actual students’ behavior, we build the 
MCTS tree and traverse it to iteratively pick the next best 
node (with highest number of simulations) as the new action 
and add it to the tail of the sequence. In the reinforcement 
learning process as illustrated in Figure 1, we repeatedly ge- 
nerate simulated action sequences that maximize a specified 
reward function, and add them to the previously generated 
data. The reinforced data set is used to construct a refined 
version of the HMMs. 


MCTS performs an iterative search with each iteration con- 
sists of 4 steps, i.e., Selection, Expansion, Simulation 
and Backpropagation [3]. In most MCTS implementati- 
ons, the Upper Confidence bounds applied to Trees (UCT) 
algorithm is applied as the reward function for Selection: 


UCT = aa (1) 


M4 


where n; is the number of simulations performed after ad- 
ding the zth action; c is the exploration parameter with a 
typically chosen empirical value of V2; t is the total num- 
ber of simulation runs for the parent node, which is equal 
to the sum of all the n;; w; is the sum of wins (1’s) for all 
simulations after adding the ith action. 


We adopt a similar reward function and compute the w; va- 
lue for generating action sequences that form a Reinforced 
scaffolding model. In this model, the normalized simulation 
results in the range of lowest-to-highest performance mea- 
sure are summed up to compute w;. For example, an action 
sequence has w; = 1 when it achieves the max map score 
(i.e., 15) in Betty’s Brain. This allows MCTS to better 
utilize coherence relations [8] to generate action sequences 
with more effective SC actions. The resulting HMM will 
favor the use of more coherent actions and be able to cap- 
ture evolvement of learning behaviors/strategies that lead to 
better learning performance. Such behavioral and strategic 
evolvements can provide the basis for adaptive scaffolding. 


We use the HMM to constrain the Expansion and Simu- 
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Figure 3: HMMs for the three clusters 


Table 1: Comparison of the Three Clusters 


IA SA Balanced 
state | state 
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lation steps to prevent expanding unvisited nodes and as- 
sociated actions that are are not likely to occur in a given 
state. With these simulation and expansion policies, we can 
always generate action sequences that fit the HMM within 
a specified variance range. Figure 4 shows a simple example 
of generating artificial action sequence by applying MCTS. 


CunTan] KeWnce: sere Cn cil scqucnes: saaeidi read 


2 OC 


we Pick rand as a Pick acpardiz aa 
MCTS next action. MCTS next action. 
seanet C) search 
rec? weaved! quiz read UNPOGT qaniz 
n= £85 Hy= 10 i= 4 Tt, = 28 Re= 105 hes 


Figure 4: Simple example of applying MCTS for 
generating action sequence. n,; is the number of si- 
mulations performed during MCTS. 


3. EXPERIMENTS AND ANALYSIS 

We use data from a Betty’ Brain study run with 98 6th 
grade middle school students in a science classroom for our 
experiments. A HMM clustering algorithm [6] is applied to 
discover groups of action sequences with high within-cluster 
homogeneities. This algorithm produced 3 clusters with the 
highest Partition Mutual Information value. HMMs for the 
three clusters are represented by the state diagrams shown 
in Figure 3, where h; represents the ith hidden state with 
corresponding initial probability 7;. State transition pro- 
babilities are marked on the transition links while emission 
probability of an action a in a state diagram is given by 
p(a). For measuring students performance in the different 
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clusters, we denote the average pre- and post-test score gain 
as S, and denote the average final causal map score of the 
group as S,,. We combine this information to interpret and 
compare students’ behaviors in the three different groups as 
shown in Table 1. 


As we can see from Table 1, all three clusters have a SA 
state (primarily focusing on SA actions). However, Cluster 
3 doesn’t have an IA state, while Cluster 2 doesn’t have sta- 
tes that balances efforts between IA & SC, and SC & SA. 
These balanced efforts are aimed to use acquired informa- 
tion or solution assessment results to support subsequent SC 
actions. Besides, only Cluster 1 maintains a good propor- 
tion of Search & Note actions which are considered to be 
more active as for acquiring information. Students in Clus- 
ter 1 and 3 did better in strategic state transitions, while 
for Cluster 2, self transitions dominated in all states. The 
performance measures of students in Cluster 1, i.e., Sg and 
Sm, are the best among all three clusters. 


3.1 Reinforced Scaffolding Model Analysis 


The reinforced scaffolding model as described in section 2.3 
is aimed to capture useful behavioral and strategic evolu- 
tions. ‘To validate it, we analyze the generated reinforced 
HMMs along with artificial action sequences that equal the 
sample size of original data set. The reinforced HMMs are 
shown in Figure 5. 


Compared to the original HMMs (Figure 3), the HMMs for 
the three clusters gradually converge to a isomorphic 3-state 
HMM structure. The differences between original and refi- 
ned HMMs can be summarized as (1) the HMMs tend to 
redistribute the efforts made between IA & SC, as well as 
SC & SA, e.g., the proportion of IA in h, is decreased for 
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Figure 5: Reinforced Scaffolding HMMs for the three clusters 


cluster 1 but it is increased for the other two clusters. Given 
the probability of LA supporting SC, Pia-se = 0.43% 3:7 
according to statistics, the reinforced HMMs tend to have all 
SC actions to be supported by at least one IA action by con- 
verging emission probability of IA and SC towards a ratio 
of 70% : 30%. This is because the SC actions being sup- 
ported by IA actions have higher probability to be effective 
(the ratio for unsupported:supported mapedits to be cor- 
rect is 0.41 : 0.53); and (2) the usage frequency for actions, 
such as search, increase significantly, especially for clusters 
2 and 3. An explanation for this phenomena is that in the 
few cases that search appeared in the original data set, it 
is very likely followed by a read that supports a subsequent 
mapedit. The original HMM captures this pattern by having 
a hidden state hs with relatively high emission probability 
for search, read and mapedit. When it expands to a node 
with search action during MCTS, the posterior probability 
for the hidden state to remain in h, is high and, therefore, 
further expansion can form this specific pattern and result 
in a higher chance of correct mapedit. Since the reward 
function is designed to optimize the causal map score, the 
reinforcement learning is likely to follow this pattern more 
frequently when generating artificial action sequences. 


4. DISCUSSION AND CONCLUSIONS 


In this paper, we proposed a novel reinforcement learning 
method for learner modeling, which integrated Hidden Mar- 
kov Model and Monte Carlo Tree Search within a Reinforce- 
ment learning framework to generate more accurate learner 
models for groups of students. We applied the HMM cluste- 
ring algorithm to divide students into groups based on their 
behaviors. Analysis and interpretation on these groups are 
presented to explain the clustering results. 


We then used data of student activities collected from a 
study with the Betty’s Brain OELE and generated reinfor- 
ced data sets along with the Reinforced scaffolding model. 
The experiments showed promising results according to our 
interpretation, where we were able to generate and inter- 
pret reinforced HMMs by analyzing evolvements of learning 
behaviors that can lead to better performance in building 
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causal maps. 


In future work, we will develop scaffolding methods to sup- 
port students’ learning new, more productive behaviors and 
strategies as they work on the system. And it will be of 
interest to study how our reinforcement learning method 
works with longitudinal studies on students and collect data 
across longer periods of time to generate dynamic coherence 
models. Besides, we will collect data from other learning 
environments, or even data from other domains to see how 
well our modeling methods perform. 
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ABSTRACT 


Personalized learning considers that the causal effects of a 
studied learning intervention may differ for the individual 
student. Making the inference about causal effects of studies 
interventions is a central problem. In this paper we propose 
the Residual Counterfactual Networks (RCN) for answer- 
ing counterfactual inference questions, such as ”Would this 
particular student benefit more from the video hint or the 
text hint when the student cannot solve a problem?”. The 
model learns a balancing representation of students by min- 
imizing the distance between the distributions of the con- 
trol and the treated populations, and then uses a residual 
block to estimate the individual treatment effect based on 
the representation of the student. We run experiments on 
semi-simulated datasets and real-world educational online 
experiment datasets to evaluate the efficacy of our model. 
The results show that our model matches or outperforms 
the state-of-the-art. 


Keywords 
Counterfactual inference, deep residual learning, educational 
experiments, individual treatment effect 


1. INTRODUCTION 


The goal of personalized learning is to provide pedagogy, 
curriculum, and learning environments to meet the needs 
of individual students. For example, an Intelligent Tutor 
System (ITS) decides which hints would most benefit a spe- 
cific student. If the ITS could infer what the student per- 
formance would be after receiving each hint, then it would 
simply choose the hint which leads to the best performance 
for the student. To make this possible, we might run an 
online educational experiment by randomly assigning stu- 
dents to one of the hints, and collect student performance. 
Then making predictions about causal effects of possible in- 
terventions (e.g. available hints) becomes a central problem 
in this case. In this paper we focus on the task of answering 
counterfactual questions [8] such as, "Would this particular 
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student benefit more from the video hint or the text hint 
when the student cannot solve a problem?” 


There are two ways of collecting data for counterfactual in- 
ference: randomized control trials (RCTs) and observational 
studies. In RCTs, participants (e.g. students) are randomly 
assigned to interventions (e.g. video hints or text hints), 
while participants in observational studies are not essentially 
randomly assigned to interventions. For example, consider 
the experiment of evaluating the efficacy of video hints and 
text hints for a certain problem. Under the design of RCT, 
students who need a hint would be randomly assigned to 
either the video hints or the text hints. In an observational 
study, students are assigned to one of the interventions based 
on their contextual information, such as knowledge level or 
personal preference. 


[5] proposed Balancing Neural Networks (BNN) which can 
be applied to solve the counterfactual inference problem. 
They used a form of regularizer to enforce the similarity be- 
tween the distributions of representations learned for popu- 
lations with different interventions, for example, the repre- 
sentations for students who received text hints versus those 
who received video hints.This reduces the variance from fit- 
ting a model on one distribution and applying it to another. 
Because of random assignment to the interventions in RCTs, 
the distributions of the populations within different inter- 
ventions are highly likely to be identical. However, in the 
observational study, we may end up with the situation where 
only male students receive video hints and female students 
receive text hints. Without enforcing the similarity between 
the distributions of representations for male and female stu- 
dents, it is not safe to make a prediction of the outcome if 
male students receive text hints. In machine learning, ”do- 
main adaptation” [7] refers to the dissimilarity of the distri- 
butions between the training data and the test data. 


Recent work [6] has demonstrated that (deep) neural net- 
works can be used with domain adaptation approaches to 


produce outstanding results on some domain adaptation bench- 


mark datasets. Motivated by their work, we propose the 
Residual Counterfactual Networks (RCN) for the counter- 
factual inference to estimate the individual treatment effect 
and evaluate its efficacy in both a simulated dataset and a 
real-world dataset from an educational online experiment. 
The RCN extends the BNN by adding a residual block to 
estimate the individual treatment effect (ITE) based on the 
learned representation of participants. The idea of the resid- 
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ual block is originated from the state-of-the-art deep residual 
learning [2]. We enable the estimation of ITE by plugging 
several layers into neural networks to explicitly learn the 
residual function with reference to the learned representa- 
tion. 


The rest of the paper is organized as follows. Section 2 pro- 
vides an overview of the problem setup of counterfactual 
inference for estimating the ITE. Section 3 details informa- 
tion of our model. Section 4 gives an overview of related 
work in this research area. Section 5 describes the datasets 
and evaluation metrics used to test our model. Section 6 
presents the results of our model and compares them with 
other models. Finally, we discuss the results and conclude 
the paper. 


2. PROBLEM SETUP 


Let 7 be the set of proposed interventions we wish to con- 
sider, X the set of participants, and Y the set of possible 
outcomes. For each proposed intervention t € 7, let Y; € Y 
be the potential outcome for « when x is assigned to the 
intervention t. In randomized control trial (RCT) and ob- 
served study, only one outcome is observed for a given par- 
ticipant x; even if the participant is given an intervention 
and later the other, the participant is not in the same state. 
In machine learning, "bandit feedback” refers to this kind of 
partial feedback. The model described above is also known 
as the Rubin-Neyman causal model [11, 10]. 


We focus on a binary intervention set TJ = {0,1}, where 
intervention 1 is often referred as the ”treated” and inter- 
vention 0 is the ’control.” In this scenario the ITE for a par- 
ticipant x is represented by the quantity of Yi(x) — Yo(z). 
Knowing the quantity helps assign participant x to the best 
of the two interventions when making a decision is needed, 
for example, choosing the best intervention for a specific 
student when the student has a trouble solving a problem. 
However, we cannot directly calculate ITE due to the fact 
that we can only observe the outcome of one of the two 
interventions. 


In this work we follow the common simplifying assumption 
of no-hidden confounding variables. This means that all the 
factors determining the outcome of each intervention are 
observed. This assumption can be formalized as the strong 
ignorability condition: 


(Ya. Yo) tla, 0 =< pe 1 le) <1 va 


Note that we cannot evaluate the validity of strong ignor- 
ability from data, and the validity must be determined by 
domain knowledge. 


In the ”treated” and the ”control” setting, we refer to the 
observed and unobserved outcomes as the factual outcome 
y* (x), and the counterfactual outcome y©* (a) respectively. 
In other words, when the participant x is assigned to the 
*control” (t = 0), y* (x) is equal to Yi(x), and y°* (x) is 
equal to Yo(x). The other way around, y” (x) is equal to 
Yo(x), and y©* (x) is equal to Yi(z). 

Given n samples Gus ae Ge V3 where yj = ti - Yilas) + 
(1 —t,;)Yo(#;), a common approach for estimating the ITE is 
to learn a function f : X x T — Y such that f(xi, ti) & yf. 
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The estimated ITE is then: 
‘ yr = aye 5 
pera = 1 Mes fat! a=: 


We assume n samples 1 (tag. Vey form an empirical 
distribution p” = {(x;i,ti)}"_,. We call this empirical dis- 
tribution p* ~ p* the empirical factual distribution. In 
order to calculate ITE, we need to infer the counterfactual 
outcome which is dependent on the empirical distribution 
po’ = {(ai,1—t;)}"_,. We call the empirical distribution 
por ~ p?F. The p* and p°* may not be equal because 
the distributions of the control and the treated populations 
may be different. The inequality of two distributions may 
cause the counterfactual inference over a different distribu- 
tion than the one observed from the experiment. In ma- 
chine learning terms, this scenario is usually referred to as 
domain adaptation, where the distribution of features in test 
data are different than the distribution of features in train- 
ing data. 


3. MODEL 


We proposed RCN to estimate individual treatment effect 
using counterfactual inference. The RCN first learns a bal- 
ancing representation of deep features ® : X — R%, and 
then learns a residual mapping Af on the representation to 
estimate the ITE. The structure of the RCN is shown in the 
left side of Figure 1. 


To learn a representation of deep features ®, the RCN uses 
fully connected layers with ReLu activation function, where 
Relu(z) = maz(0,z). We need to generalize from factual 
distribution to counterfactual distribution in the feature rep- 
resentation ® to obtain accurate estimation of counterfac- 
tual outcome. The common successful approaches for do- 
main adaptation encourage similarity between the latent fea- 
ture representations w.r.t the different distributions. This 
similarity is often enforced by minimizing a certain distance 
between the domain-specific hidden features. The distance 
between two distributions is usually referred to as the dis- 
crepancy distance, introduced by [7], which is a hypothesis 
class dependent distance measure tailored for domain adap- 
tation. 


In this paper we use an Integral Probability Metric (IPM) 
measure of distance between two distributions po = p(a|t = 
0), and pi = p(a|t = 1), also known as the control and 
treated distributions. The IPM for po and pj is defined as 


i fdpo [ fdp, 


where F is a class of real-valued bounded measurable func- 
tions on S. 


) 


IPM+#(po, p1) := sup 
fEF 


The choice of functions is the crucial distinction between 
IPMs [15]. Two specific IPMs are used in our experiments: 
the Maximum Mean Discrepancy (MMD), and the Wasser- 
stein distance. When F = {f : ||f\|,, <1}, where H rep- 
resents a reproducing kernel Hilbert space (RKHS) with k 
as its reproducing kernel, IPM-z is called MMD. In other 
words, the family of norm-1 reproducing kernel Hilbert space 
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fo(x) 


F(x) = x+ AF(z) 


Figure 1: (left) Residual Counterfactual Networks for counterfactual inference. IPM is adopted on layers fcl 
and fc2 to minimize the discrepancy distance of the deep features of the control and the treated populations. 
For the treated group, we add a residual block fcr1-fcr2 so that fr(x) = fo(x) + Af(x); (right) Residual block 


(RKHS) functions lead to the MMD. The family of 1-Lipschitz 


functions F = {f:||f||, <1}, where ||f||,, is the Lipschitz 
semi-norm of a bounded continuous real-valued function f, 
make IPM the Wasserstein distance. Both the Wasserstein 
and MMD metrics have consistent estimators which can be 
efficiently computed in the finite sample case [14]. The im- 
portant property of IPM is that po = pi iff IPM+#(po, pi) = 
0. 


The representation with reduction of the discrepancy be- 
tween the control and the treated populations helps the 
model to focus on balancing features across two populations 
when inferring the counterfactual outcomes. For instance, 
if in an experiment, almost no male student ever received 
intervention A, inferring how male students would react to 
intervention A is highly prone to error and a more conser- 
vative use of the gender feature might be warranted. 


After balancing the feature representations of the control 
and the treated populations, the next step is to infer the 
treatment effect for participant x. We adopt the residual 
block [2] to estimate the treatment effect. 


As shown in the right side of Figure 1, F'(x) is the underly- 
ing desired function mapping. Instead of stacking a number 
of layers to fit the desired F(x), we let stacked fully con- 
nected layers learn the residual mapping Af (x) = F(x) —a. 
Then the origin mapping is converted into Af(x) + x. The 
operation Af(x) + x is performed by a shortcut connection 
and an element-wise addition. Learning residual mapping 
is favored over fitting the desired mapping directly, because 
it is easier to find the residual with reference to an identity 
mapping than to learn the mapping as new. 


The goal of the residual block is to approximate a residual 
function Af such that fr(x) = fce(x) + Af(fc(x)), where 
fc is the deep representation of participant x before being 
fed into the output layer, and fr is the input to the output 
layer for the treated population. The output layer is a ridge 
linear regression to generate the final outcome. From the 
definition of the residual function Af, we see that Af(x) 
is the estimated treatment effect for participant x, which 
is our interest in a control and treated experiment. With 
the residual block directly connected to fc2, the residual 
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function Af(x) is dependent on the feature representation 
of participant x. 


We plug in the residual block (shown in Figure 1) between 
fc2 layer and final output layer for the treated population 
in order to estimate the ITE. There is no residual block 
plugged in between fc2 layer and the final output layer for 
the control population. The final output layer y(-) is a lin- 
ear regression to calculate the predicted outcome, such that 


Ye= 9(fe(a)), and Yt = y(fr(2)). 


Recall the problem setup described above that there exist 
n samples ety, yh nos where y/ = t;- Yi(ai) + (1 —- 
t:)Yo(a:). In the control and the treated setting, we as- 


sume that n-(n- > 0) samples { (2,0, y,) ; 


~ De are 


assigned to the control (t = 0), and ni(nz > 0) samples 
Vee 1,y{)} " =~ Dy are assigned to the treated (t = 1), 
i=1 


such that n = Ne + nz. As described above, RCN is an 
integration of deep feature learning, feature representation 
balancing, and treatment effect estimation in an end-to-end 
fashion with the loss function as such: 


Ne 


; 1 (0) 
min ae L( fel Xi), Y; 
fr=fstAf(fs) Te Gee) te) 


1 
+ DIL), 4”) 
i=1 


+ \-IPM(Dz, D:), 


where A is the tradeoff parameter for the IPM penalty, L is 
the loss function of the model. In the case of binary clas- 
sification, L is the standard cross entropy. In the case of 
regression, L is root-mean-square error (RMSE). During the 
training, the model only has the access to the factual out- 
come. 


4. RELATED WORK 


From a conceptual point of view, our work is inspired by 
the work on domain adaptation and deep residual learn- 
ing. [6] proposed the Residual Transfer Network that adopt 
MMD distance to learn transferable deep features from la- 
beled data in the source domain and unlabeled data in the 
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target domain and adds a residual block to transfer the pre- 
diction classifier from the target domain to the source do- 
main. The structure of our model is similar to that of their 
model. Deep residual learning is introduced by [2], the win- 
ner of the ImageNet ILSVRC 2015 challenge, to ease the 
training of deep networks. The residual block is designed to 
learn residual functions AF(x) with reference to the layer 
input x. Reformulating layers to the residual block makes 
the training easier than directly learning the original func- 
tions F(x) = AF (x) +x. 


Our model extends the work by [5, 13], where the authors 
build a connection between domain adaptation and counter- 
factual inference. They use IPMs, such as MMD and wasser- 
stein distance, to learn a representation of the data which 
balances the control and treated distributions. ‘The treat- 
ment assignment is concatenated with the representation to 
predict the factual outcome as while the reverse treatment 
assignment is concatenated with the representation to pre- 
dict the counterfactual outcome. Compared to their work, 
we add a residual block to estimate the individual treatment 
effect based on the representation. [17, 1] proposed random 
causal forests (RCF) which is built upon the idea of random 
forests to estimate the heterogeneous treatment effect. 


5. EXPERIMENTS 


5.1 Evaluation Metrics 
To compare among various models, we report the RMSE of 
estimated individual treatment effect, denoted 


tee : Y2(Mil@i) — Yol@s)) - IP E(@:))?, 


and the absolute error in average treatment effect 


n nm 


care =|) S (files) — fale) =) Sao) — Yolo). 


Following [4, 5], we report the Precision in Estimation of 
Heterogeneous Effect (PEHE), 


PEHE = : _((%(@:) — Yo(@s)) — Wales) — fol@s))?. 


Compared to the fact that achieving a small RMSE of esti- 
mated ITE needs the accurate estimation of counterfactual 
responses, a good (small) PEHE requires the accurate esti- 
mation of both factual and counterfactual responses. 


However, calculating €rrz, €are, and PEHE requires the 
”"eround truth” of the ITE for each participant in the ex- 
periment. We cannot gather the counterfactual outcomes 
from RCTs and observational studies, and thus do not have 
the ITE of each participant. We cannot evaluate e€rrg and 
PEHE on these datasets. In order to evaluate the perfor- 
mance on these datasets across various models, we use a 
measure, called policy risk, introduced by [13]. Given a 
model f, the participant x is assigned to the treatment 
we(x) = 1 if f(x,1) — f(z,0) > A (in the case of RCN, 
Af > ), where X is the treatment threshold, and to the 
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Figure 2: CFR for ITE estimation. L is a loss func- 
tion, IPM is an integral probability metric 


control 7+(xz) = 0 otherwise. The risk policy is defined as: 
Reoi(ms) = 1 — (E[Yi|7¢(@) = 1] - paz = 1) 
+ E[Yo|7¢(2) = 0] - p(w = 9)). 


The empirical estimator of the risk policy on a dataset is 
calculated by: 


Rpoi(ms) = 1 — (E[Yilry(x) = 1,t = 1] - p(w = 1) 
+ E[Yo|a p(x) = 0,t = 0] - 


To obtain the policy risk, we use the method introduced by 
[16]. We select a subset of participants in the dataset where 
the treatment recommendation inferred by the model is the 
same as the treatment assignment in the experiment and 
then calculate the average loss from the subset of the data 
(see Table 1 for illustrative data). 


For the datasets without the “ground truth” on ITE, we 
also calculate the average treatment effect on the treated by 


ATT = - an = ar yo, and report the error on 


ATT as earr = |ATT — 2 Ry (fe(ai) — fo(xi))}. 


5.2 Baselines 

Balancing Neural Networks (BNN) is a neural networks- 
based model for counterfactual inference. Compared to RCN, 
it has exactly the same fcl and fc2 layers with IPM regu- 
larizer to learn the representation ®(x) of the participant 
x. However, instead of using residual block to estimate 
treatment effect, it concatenates the treatment assignment 
t; to the output of fc2 layer ®(x) and feeds [®(x;), t;] to an- 
other two fully connected layers to generate the predicted 


outcome. We refer to this particular structure of BNN as 
BNN-2-2, following [5]. 


The Counterfactual Regression (CFR) [13] is built on the 
BNN. The important difference between these two models 
is that the CFR uses a more powerful distribution metric in 
the form of IPMs to learn a balancing representation. We 
compare our model with BNN-2-2 and CFR to verify the 
efficacy of residual block in terms of estimating individual 
treatment effect. 


We introduce a simple neural networks baseline model to 
evaluate the efficacy of the IPM regularizer and residual 
mapping. This baseline model is a feed-forward neural net- 
works model with four hidden layers, trained to predict the 
factual outcome based on X and ft, without the IPM regu- 
larizer and the residual block. We refer to this as NN-4. 
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Table 1: Hypothetical data for some example students. The predicted outcome is the probability that the 
student would complete the assignment. Students in bold are those whose randomized treatment assignment 
is congruent with the recommendation of the counterfactual inference model. Data from these students would 


be used to calculate the policy risk. 


Predicted Predicted icone 
ID Group Completion outcome if outcome if Treat? 
effect 
treated not treated 

1 Control 1 0.8 0.75 0.05 1 

2 Control 0 0.3 0.45 -0.15 0 

3 Treatment O 0.50 0.38 0.12 1 

4 ‘Treament 1 0.91 0.99 -0.08 0 
5.3. Simulation based on real data - IHDP 
The Infant Health and Development Program (IHDP) dataset Table 2: Results of IHDP 
was a semi-simulated dataset introduced by [4]. The dataset Model crre ¢arze PEHE 
consists of a number of covariates from a real randomized NN-4 2.0 0.5 ee 
experiment. The goal of the experiment is to study the im- BNN-2-2 1.7 0.3 10 
pact of superior child care and home visits on future cogni- CFR 1.4 0.2 1.6 
tive test scores. [4] discarded a biased subset of the treated RCN 1.1 0.05 1.4 


population in order to introduce imbalance between treated 
and control subjects and used a simulated counterfactual 
outcome. Eventually, there are 747 subjects (139 treated, 
608 control), each represented by 25 covariates assessing the 
attributes of the children and their mothers. 


5.4 ASS ISTments dataset 


The ASSISTments online learning platform [3] is a free web- 
based platform utilized by a large user-base of teachers and 
students. The platform has been the subject of a recent 
study within the state of Maine [9], demonstrating signif- 
icant learning gains for students using the platform. The 
dataset used in this work comes from one of 22 random- 
ized controlled experiments [12] collected within the plat- 
form. This experiment was run in assignment types known 
as ’skill builders” in which students are given problems until 
a threshold of understanding is reached; within ASSIST- 
ments, this threshold is traditionally three consecutive cor- 
rect responses. Reaching this threshold denotes sufficient 
performance and completion of the assignment. In addi- 
tion to this experimental data, information of the students 
prior to condition assignment is also provided in the form of 
problem-level log data providing a breadth of student infor- 
mation at fine levels of granularity. 


In this experiment, there are two kinds of hints (video versus 
text) available for each problem from the assignment when 
students answer the problem incorrectly. The assignment 
to the video hint and the text video was random. Video 
content was designed to mirror text hint in an attempt to 
provide identical assistance. There are 147 students who 
received the video hint and 237 students who received the 
text hint. The dataset includes 15 covariates such as stu- 
dent past-performance history, class-past performance his- 
tory. We solve a binary classification task which is to predict 
the completion of the assignment for each student. 


6. RESULTS 

The results of IHDP is presented in Table 2 when the treat- 
ment threshold 4 = 0. We see that our proposed RCN per- 
forms the best on the dataset in terms of estimating ITE, 
ATE and PEHE. There is an especially large improvement 
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on estimating ITE. ‘These results indicate that the residual 
block Af (a) helps accurately predict the value of ITE based 
on the feature representation ®(2) for a given participant 2. 


The results of ASSISTments dataset are the interest of our 
work since we hope to apply the RCN to educational ex- 
periments in order to support decision making in terms of 
personalized learning. The results in terms of policy risk and 
the average treatment effect on the treated are shown in Ta- 
ble 3 when the treatment threshold 4 = 0. The model TA 
means ”Treated All” where all students are assigned to the 
treatment while the model NT means ”Not Treated” where 
all students are assigned to the control. Without considering 
that the effects of an intervention may differ for individual 
students, the model with the better performance out of these 
two models would be adopted when a choice must be made 
between these two interventions. The RCN, which consid- 
ers the individual treatment effect, outperforms the TA and 
the NT. This indicates that taking the individual effect into 
account helps make a better choice of interventions. The 
comparison between the CFR and the RCN suggests that 
the RCN performs better than the CFR does in terms of 
risk policy and ATT. 


To investigate the correlation between policy risk and treat- 
ment threshold A, we plot the value of policy risk as a func- 
tion of treatment threshold » in Figure 3. For the results 
of the ASSISTments dataset from the CFR, the maximum 
predicted ITE in the dataset is 0.44. Once the threshold > 
is larger than 0.44, the CFR is converted to ”Not Treated” 
where all students are assigned to the control. Since the 
maximum predicted ITE in the ASSISTments dataset from 
the CFR is 0.18, the CFR is converted to ”Not Treated” once 
the treatment threshold X is larger than 0.18. 


7. CONCLUSION 


As online educational experiments become popular and easy 
to conduct, and machine learning becomes a major tool for 
researchers, counterfactual inference gains a lot of interest 
for the purpose of personalized learning. In this paper we 
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—— RCN 
— CFR 


policy risk 


0.0 0.2 0.4 0.6 0.8 1.0 
treatment threshold 


Figure 3: Treatment threshold versus policy risk on 
ASSISTments dataset. The lower policy risk is the 
better. 


Table 3: Results of the ASSISTments Dataset 
Model 


Rpouw €ATT 
TA 0.14 - 
NT 0.27 - 
CFR 0.14 0.08 
RCN 0.08 0.03 


propose the Residual Counterfactual Networks (RCN) to es- 
timate the individual treatment effect. Because of the dis- 
similarity between the distributions of the control and the 
treated populations, the RCN uses IPMs, such as Wasser- 
stein and MMD distance, to learn balancing deep features 
from the data. A residual block is adopted on the deep fea- 
tures to learn the individual treatment effect (ITE) so that 
estimation of the ITE is dependent on the deep features. We 
apply our model to both synthetic datasets and real-world 
datasets from online educational experiment, indicating that 
our model achieves the state-of-the-art. 


One open question for the future work is how to generalize 
our model for the situations where there is more than one 
treatment in the experiment. Integral Probability Metric 
(IPM) can only measure the distance between two distribu- 
tions. We could use pair-wised IPM if there are more than 
two distributions. But this would be computationally time- 
consuming if the number of distributions increases. Since 
running experiments is expensive and collecting enough data 
for the model to make a reliable prediction is difficult, we 
need a better optimization algorithm which allows us to 
train the model efficiently. 
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ABSTRACT 


Student persistence in online learning environments has typically 
been studied at the macro-level (e.g., completion of an online 
course, number of academic terms completed, etc.). The current 
examines student persistence in an adaptive learning environment, 
ALEKS (Assessment and LEarning in Knowledge Spaces). 
Specifically, the study explores the relationship between students' 
academic achievement and their persistence during learning. By 
using archived data that included their math learning log data and 
performance on two standardized tests, we first explored student 
learning behavior patterns with regard to their persistence during 
learning. Clustering analysis identified three distinctive patterns of 
persistence-related learning behaviors: (1) High persistence and 
rare topic shifting; (2) Low persistence and frequent topic 
shifting; and (3) Moderate persistence and moderate topic 
shifting. We further explored the association between persistence 
and academic achievement. No significant differences were 
observed between academic achievement and the different 
learning patterns. We interpret this result in addition to a 
preliminary exploration of topic mastery trends, to suggest that 
"wheel-spinning" behaviors coexist with persistence, and is 
ultimately not beneficial to learning. 


Keywords 


ALEKS, persistence, academic achievement 


1. INTRODUCTION 


Assessment of LEarning in Knowledge Space (ALEKS) is an 
online adaptive learning system built based on Knowledge Space 
Theory [8]. According to Knowledge Space Theory, a knowledge 
domain is represented by a finite set of concepts. The knowledge 
state of a student in a domain can be represented by a particular 
subset of concepts that the student is capable of mastering. By 
gauging learner’s knowledge state, ALEKS determines what a 
student knows and is ready to learn, and provides personalized 
learning paths that are ideal for each student [3]. When a learner 
first use ALEKS, the system starts with an individualized initial 
assessment to find the student’s knowledge state. The assessment 
usually consists of 20 to 30 problems (out of more than 600 
problems). After the initial assessment, the student receives a 
report in a color-keyed pie chart (as shown in Fig. 1). Each "slice" 
of the pie chart corresponds to a particular area of the syllabus, 
and the darker shades of color indicating how much the student 
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has mastered in that area [1]. After the first assessment, ALEKS 
identifies the student’s knowledge state and generates a list of 
topics the student is ready to learn in each area. Once a student 
chooses the area and topic he/she wants to work on, ALEKS will 
provide a set of problems, and the student learns by solving 
problems under a specific topic. After successfully solving 
problems covering the same topic, the system will determine a 
student’s mastery of the topic and the add the topic to the 
student’s knowledge pie, and the student can then move onto a 
new topic [2]. 

» Algebra (46 of 71) 


> whole Numbers (76 of 80) 


} Proportions, Percents, 
and Probability (13 of 33) 


} Fractions and Decimals (52 of 66) 


}» Measurement and Graphs (32 of 48) - 


a 


> Geometry (36 of 72) 


Figure 1: ALEKS knowledge pie showing number of concepts 
learner has learned and needs to learn 


As one of the popular adaptive learning systems, ALEKS was 
evaluated in some empirical studies which were carried out in 
different settings, and was observed to be effective in most of the 
studies [6, 9, 12, 13, 16, 19]. These studies generally measured 
ALEKS students’ learning gains or academic achievements; 
however, none of them looked at students’ learning process, or 
online learning behaviors. In this study, we explored students’ 
offline learning outcomes and online learning behavior patterns, 
and investigated whether persistence was associated with 
academic achievement in an individualized online learning 
environment. We further examined students’ wheel-spinning 
behaviors [5] in order to understand the association. 


2. RELATED WORK 


In this section, we will introduce how persistence has been studied 
in different learning contexts--traditional classroom environment 
and online learning environment, and how the relationship 
between persistence and academic achievement has_ been 
investigated. Persistence is “the quality that allows someone to 
continue doing something or trying to do something even though 
it is difficult or opposed by other people” [15]. According to 
Rovai, persistence is the behavior of continuing action despite the 
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presence of obstacles [22]. Persistence in the face of adversity is 
often described as a result of high motivation. For instance, in the 
literature investigating classroom learning, persistence was 
typically examined as an outcome factor of motivation. Elliot and 
his colleagues [7] found mastery goals and performance approach 
goals were positive predictors of persistence; Vansteenkiste et al. 
[24] found intrinsic motivation improved student persistence; 
Multon et al. [18] proved that self-efficacy facilitated persistence. 
Although the concept of persistence was studied in different 
literature, it was operationalized in various ways. For example, in 
the meta-analysis by Multon and his colleagues [18], they 
summarized three ways of operationalizing persistence after 
viewing eighteen studies-- time spent on task, number of items or 
tasks attempted or completed, and number of academic terms 
completed. Apart from these three commonly used measures, 
persistence was also frequently measured with self-reports [4, 7, 
27}. 


In the context of online learning environment, persistence was 
usually defined as the completion of an online course, or an 
antonym of attrition [10, 14, 20, 22]. Persistent learners, who were 
referred to as “completers”, were the learners who successfully 
completed an online course. Non-persistent learners, who were 
referred to “dropouts”, were the learners who did not finish a 
course [10, 14]. Persistence was mainly explored as a dependent 
variable affected by psychological and social factors, such as self- 
motivation, engagement, economic support, etc. [14]. Persistence 
was also investigated as a consequence correlated with online 
behaviors such as participation, discussion, etc. [17, 21]. 


Despite various studies on persistence in learning, persistence was 
rarely studied as a predictive factor. Stekel and Tobias [23] 
hypothesized a curvilinear relationship between self-estimated 
persistence and achievement. They predicted a moderate amount 
of persistence would lead to the highest achievement. They also 
hypothesized that persistence would be positively related to 
achievement in lecture-related instructional environment, but 
unrelated in the individualized instructional environment. 
However, they failed to prove their hypotheses. While examining 
the mediation effect of persistence on the relationship between 
goals and academic achievement, Elliot et al. [7] found self- 
reported persistence was a positive predictor of exam performance 
in lecture-based classroom setting. This proved one of Stekel and 
Tobias’ hypotheses. For online learning system like ALEKS, the 
instructional context could be considered individualized because 
ALEKS models student’s knowledge state and always provides 
the concepts students are ready to learn. Therefore, we wonder 
whether persistence is unrelated to academic achievement in the 
individualized learning environment like ALEKS. 


3. METHODS 
3.1 DATA SETS 


The data sets used for this study were collected from Jackson- 
Madison Intelligent Tutoring System Evaluation (JMITSE) 
program. JMITSE was an after-school program applied in five 
middle schools in Jackson-Madison County School System of 
Tennessee from 2009 to 2012. The goal of JMITSE program was 
to investigate whether technology outperformed human teachers 
in math teaching. There were two experimental conditions: 
teacher condition and technology condition. In the teacher 
condition, students learned math with math teachers in the after- 
school program. In the technology condition, students learned 
math with ALEKS. For this study, we only used data from the 


ALEKS condition. The program lasted for three academic years 
and 366 sixth-graders were assigned to the ALEKS condition 
altogether. Participants were supposed to study for two one-hour 
sessions every week, for twenty-five weeks. Logs of all students’ 
online learning activities were recorded by the system. The 
ALEKS log file included students’ online ID, the topics (1.e. 
concepts) students attempted, learning mode (ie. learning, 
review), time elapsed and the result of each attempt. For each 
attempt, there are five possible results: correct, wrong, explain, 
added to pie and failed. “Correct” is shown after a learner 
attempts a task and gets the correct answer. “Wrong” is shown 
after a learner attempts a task and gets a wrong answer. After a 
learner gets a wrong answer, two buttons “Try” and “Explain” 
will be shown to the learner. If the learner hits the “Try” button, 
he/she will be given another problem to work on. If the learner 
hits the “Explain” button, a worked example of that problem will 
be provided (as shown in Fig. 2). Reading an explanation is 
regarded as an attempt and the result is recorded as “Explain.” 
“Added to Pie” is shown after learner attempts a problem 
correctly. The difference between “Added to Pie” and “Correct” is 
that “Correct” is based on one single attempt, but “Added to Pie” 
is based on multiple correct attempts. When a learner can 
correctly answer problems under a concept consistently, ALEKS 
decides the learner has mastered the concept and adds the concept 
to the learner’s knowledge pie. After being added to the 
knowledge pie, that topic will not be given to the learner again, 
except for reviewing. “Failed” is shown after a learner attempts a 
task and answers incorrectly. Similar to “Added to Pie’, it is not 
merely based on one single attempt, instead, it happens when 
there are multiple unsuccessful attempts and the system decides 
that the learner failed to learn that topic. 


The participants of JMITSE took the Tennessee Comprehensive 
Assessment Program (TCAP), which 1s a standardized test, twice. 
Before entering the program, the students took TCAP5, which 
was TCAP for 5th graders. After finishing the program, the 
students took TCAP6, which was TCAP for 6th graders. The two 
tests were used as pretest and posttest in the analysis. 


3.2 DATA PROCESS 


The log file used in this study contains 366 students’ 330,319 
lines of online learning sequence. Each line represents an attempt 
from a student on one topic. Most students attempted multiple 
topics, and most topics were attempted multiple times. Therefore, 
for each student, there were multiple rows of data. Firstly, the data 
was aggregated at topic level. After aggregation, the number of 
observations for each individual student equaled to the number of 
topics they attempted. For each topic attempted by a student, we 
computed the number of attempts and amount of time spent on the 
topic, as well as whether it was mastered. We named the variables 
“Attempt”, “Time” and “Master”. Pearson product-moment 
correlation coefficient indicated that “Attempt” and “Time” were 
highly correlated (7=.98). To determine which variable to use as 
the measure of effort, we further examined the distribution of the 
two variables. The distribution of the two variables revealed that 
neither of them were normally distributed. However, after log 
transformation, “Attempts” became approximately normally 
distributed, but “Time” was still skewed (as shown in Fig. 2). 
Therefore, “Attempts” was chosen to measure student’s effort on 
task. Next, we created three variables as measures of persistence 
and dummy coded them. They were “High persistence’, 
“Moderate persistence” and “Switch”. While “High persistence” 
and “Moderate persistence” were used to describe different levels 
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of persistent learning behaviors, “Switch” was used to describe 
non-persistent behaviors when a student gave up a topic quickly 
and switched to a new topic before mastery. For a topic, if its log- 
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Figure 2: Distribution of log-transformed attempts and log- 
transformed time on each topic 


transformed attempts were in the fourth quartile of the 
distribution, “High persistence” was coded 1, otherwise it was 
coded 0. If its log-transformed attempts fell into the second or 
third quartile of the distribution, “Moderate persistence” was 
coded as 1, otherwise it was coded 0. For “Switch”, both attempts 
and the result were taken into account. If a topic’s log- 
transformed attempts was in the first quartile of the distribution, 
and the topic was not mastered, “switch” was coded 1, otherwise 
it was coded 0. After the new variables were created and coded, 
the 51,982 rows of data were aggregated to student level by 
averaging the persistence variables, and we got 366 observations. 
After second aggregation, the three persistence variables became 
continuous rather than binary. These variables represent the 
percentage of topics that a student persisted at each level. For 
instance, if a student gets 0.2 in “high persistence”, it means that 
the student attempted twenty percent of the topics with high 
persistence. Lastly, we computed the number of topics each 
student attempted for data screening. The three persistence 
variables were percentages, which represented the percentage of 
topics attempted with some type of behaviors. If the total number 
of topics attempted by the were too small, it did not necessarily 
imply certain behavior patterns, even if the percentage for that 
behavior was high. Therefore, we decided to screen the students 
who only attempted a small number of topics. Based on the 
distribution of topics attempted by each student, the students 
whose attempted number of topics fell within the first quantile 
(Topics<=61) were screened from further analysis. There were 
275 observations after screening. 


After data process, we conducted cluster analysis to explore 
students’ persistence learning patterns. We performed analysis of 
covariance to compare academic achievements of students from 
different groups to explore the association between online 
behavior and academic achievement. We also conducted analysis 
of variance to compare the mastery topics between groups to 
better understand the association. 


4. RESULTS 
4.1 CLUSTER ANALYSIS 


There is no strictly defined sample size for cluster analysis. 
According to the suggestion of Formann [11], the minimal sample 
size should be no less than 2 “ cases (k = number of variables), 
preferably 5*2. The study examined the clustering of 275 
observations across three variables, which fell comfortably within 
the accepted range. Ward’s [25] hierarchical clustering technique 
was applied and the squared Euclidean distance was used to 
calculate the distance between clusters. A scree plot was used to 
determine the optimum number of clusters, where the levelling- 
off point indicated a reduced variability between clusters after it 
[26]. Examination of scree plot revealed flattening between three 
and four clusters, indicating that a three-cluster solution best 
captured the similarities and differences between students on the 
three variables. The cluster membership did not change by 
repeating the analysis, and significant differences were found by 
conducting ANOVAs for the clustering variables, which further 
confirmed the quality of the solution. The three-cluster solution is 
shown in Fig. 3. The scales are the percentage of topics students 
attempted with a specific behavior. The scales are the percentage 
of topics students attempted with a specific behavior. For 
example, the y axis of the top row is the percentage of switch 
behavior. The x axis of the top middle block is the percentage of 
moderate persistent learning behavior, and x axis of the top right 
block is the percentage of high persistent learning behavior. From 
the top middle block, we can find the clusters are more distinct on 
switch behavior (i.e. y axis), whereas on the moderate persistence 
behavior (i.e. x axis) there is more overlap between the student 
clusters. From the top right block, we can find the black cluster 
has more high persistent learning behavior, and the green and red 
clusters have more overlap. The descriptive statistics on the 
grouping variables and the academic achievement variables, that 
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we further explored, are shown in Table 1. 


Figure 3: Scatterplot matrices of three-level persistence of 
three clusters 
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Cluster 1: High persistence, low switch 


Cluster 1 (i.e. the black cluster in Fig. 3) accounts for 37.5% of 
the study sample (n=103). The students in this cluster switched 
topics less than members of other two clusters. The switching 
ratio of cluster 1 is 0.16, which indicates that students quickly 
gave up or switched to other topics before mastery for 16% of the 
tasks they attempted. For 34% of the tasks, the students worked 
with moderate persistence (i.e. attempted the task for 3-7 times). 
And for 31% of the tasks, the students worked with high 
persistence (i.e. attempted the task for 8 or more times). These 
students did not easily give up on tasks, and put a large amount of 
effort on one third of the tasks they got, which indicated that they 
were persistent learners. 


Table 1: Mean scores and standard deviations for each 
variable by cluster 


Cluster 1 Cluster 2 Cluster 3 
(n = 103) (n = 54) (n = 118) 
Switch 0.16 0.36 0.23 (o=0.05) 
(o=0.05) (o=0.05) 
Moderate 0.34 0.28 (o=0 0.34 (o=0.05) 
persistence (o=0.05) 05) 
High 0.31 0.19 0.18 (o=0.04) 
persistence (o=0.07) (o=0.07) 
TCAP5 46.72 39.37 47.28 
(o=18.25) (o=17.60) (o=17.23) 
TCAP6 43.23 32.69 40.49 
(o=20.89) (o=18.44) (o=21.63) 


Cluster 2: Low persistence, high switch 


Cluster 2 (i.e. the red cluster in Fig. 3) 1s a comparatively smaller 
cluster including 19.6% (n=54) of the study sample. The 
distinctive characteristics of this cluster is their high switching 
ratio. For 36% of the tasks they were given, the learners quickly 
gave up or switched to new tasks before mastering them. The 
students worked with moderate persistence (1.e. attempted the task 
for 3-7 times) on 28% of the tasks. And worked with high 
persistence for 19% of the tasks (i.e. attempted the task for 8 or 
more times). Compared with the other two clusters, the students in 
this cluster were not very persistent. Although they worked on 
some tasks with multiple attempts, they gave up on a large 
percentage of the tasks, and they were not willing to put too much 
effort on a task. 


Cluster 3: Moderate persistence, moderate switch 


Cluster 3 (i.e. the green cluster in Fig. 3) 1s the largest cluster with 
118 students representing 42.8% of the study sample. The student 
in this cluster switched topics on 23% of the tasks, which is higher 
than that of Cluster 1 but lower than that of Cluster 2. They 
worked with moderate persistence on 34% of the tasks and with 
high persistence on 18% of the tasks. Compared to the other two 
clusters, this cluster does not distinctively stand out in any type of 


behavior. The students gave up a medium portion of topics and 
worked with high effort on a comparatively low portion of topics. 
They worked on the tasks with mostly moderate persistence. It 
seems they were regulating their learning in a rational way in the 
self-regulated learning environment. 


4.2 ANALYSIS OF 
(ANCOVA) 


In order to investigate the association between persistence and 
academic performance, a one-way analysis of covariance 
(ANCOVA) was conducted to determine a statistically significant 
difference between three clusters on posttest scores controlling for 
pretest scores. The effect of cluster on posttest scores after 
controlling for pretest scores was not statistically significant, 
F(2,212) = 1.25, p = .29, which means the academic achievement 
of the three clusters with different behavior patterns were not 
significantly different from each other. 


COVARIANCE 


4.3 ANALYSIS OF VARIANCE (ANOVA) 
AND POST HOC TESTS 


In order to understand why persistence was not related to 
academic achievement, we further examined the percentage of 
topics attempted with moderate persistence and high persistence. 
For clusters one, two and three, the percentages of tasks attempted 
with moderate persistence without mastery were 0.11 (co = 0.05), 
0.08 (o = 0.04) and 0.07 (o = 0.03), respectively. The percentages 
of tasks attempted with high persistence without mastery were 
0.21(o = 0.08), 0.17 (6 = 0.06) and 0.16 (o = 0.06). Analysis of 
variance (ANOVA) indicated a significant difference of the 
unmastered topics attempted with moderate (F (2, 272) = 30.3, p< 
.001) and high persistence (F(2,272) = 14.3, p < .001) among the 
three clusters. Post-hoc tests indicated Cluster 1 was significantly 
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higher than both Cluster 2 and Cluster 3 in unmastered topics with 
both moderate and high persistence. This provides some insight as 
to why persistence did not make a difference in learning: the 
students were wheel-spinning [5]. We explored two highly 
attempted topics in our data sets and found the probability of 
mastering those topics got close to zero after a certain number of 
attempts (as shown in Fig. 4). This indicates the existence of 
wheel-spinning. 


Another one-way analysis of variance (ANOVA) was conducted 
to determine a statistically significant difference between three 
clusters on the number of mastered topics at different difficulty 
levels. The topics were divided into three levels based on the 
percentage of students who mastered them. The topics in the first 
quartile had the highest mastery percentage, which we defined as 
easy topics. The topics in the second and third quartiles had the 
medium mastery percentage, and were defined as medium topics. 
The topics in the fourth quartile, had the lowest mastery 
percentage, and were defined as hard topics. The numbers of 
mastered easy topics were not found to be significantly different 
among three clusters, F (2,272) = 2.56, p = .08. However, the 
numbers of mastered medium (F (2,272) = 9.98, p = 0) and hard 
topics (F (2,251) = 8.92, p = 0) were found to be significantly 
different between clusters. Post-hoc tests indicated that cluster one 
and three mastered significantly more medium and hard topics 
than cluster two, but there was no statistically significant 
difference between cluster one and three. The means and standard 
deviations of the number of topics mastered by each cluster are 
shown in table 2. 


Table 2: Means and standard deviations of the number of 
topics mastered by each cluster 


Cluster 1 Cluster 2 Cluster 3 
Easy topics 24.06 22.7 (o=12.88) 27.21 
(o=12.1) (o=15.08) 
Medium topics 49.46 33.56 51.75 
(o=26.85) (o=18.74) (o=26.99) 
Hard topics 16.11 6.86 (o=6.05) 14.27 
(o=13.93) (o=12.13) 


5. DISCUSSION AND CONCLUSION 


In previous research, student persistence has only been measured 
by macro-level data (e.g., completion of an entire course). This 
study took a different approach by examining persistence at a 
more micro-level; specifically, we looked at student persistence 
within specific tasks in the ALEKS learning system. We were 
able to extract three distinct clusters of persistence related student 
behaviors through cluster analysis. The students in the high 
persistence cluster put medium to high effort in most of the topics 
they attempted, and they rarely switched to a new topic before 
mastery. The students in the moderate persistence cluster put 
medium effort in most topics they attempted and they did not 
easily give up topics before mastery. The students in the low 
persistence cluster frequently switched to new topics before 


mastery, often giving up tasks after one or two attempts. The 
comparison of students’ academic achievement in the three 
clusters did not reveal any significant difference. This result is 
consistent with the hypothesis proposed by Stekel and Tobias 
[23], who suggested that persistence and achievement are 
unrelated within individual learning contexts. Although learning 
gains were not different between clusters in standardized tests, the 
mastery of topics was found to be different. The more persistent 
clusters--cluster one and cluster three-- mastered more medium 
and hard topics than the non-persistent cluster--cluster two. This 
suggests persistence was associated with learning in ALEKS, 
especially for more difficult topics. The inconsistency between 
learning gain in ALEKS and TCAPs might be related to different 
topics covered in ALEKS and TCAPs. 


It is worth noting that the pretest and posttest assessments present 
a limitation to the current analysis. The TCAP5 and TCAP6 were 
used as pretest and posttest measures, and may cover different 
concepts that are not well aligned. However, a further look at the 
possible reasons behind non-productive persistence suggested 
wheel-spinning might relate to ineffective learning. That is, even 
though students were persistently working on a single topic, they 
appeared to be at an impasse. These impasses were not resolved 
with more attempts, which ultimately resulted in the student never 
mastering the topic. Although ALEKS has a system that can 
detect ineffective learning and provide feedback, like “Failed’’, to 
learners, the percentage of “Failed” was very low (1.e., 1%). In 
many cases, learners were struggling and wheel-spinning, but the 
system did not stop them with a “Failed” indicator, or any other 
type of intervention. Therefore, we suggest ALEKS to improve 
the mechanism to detect wheel-spinning and provide intervention 
in a timely manner. 
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ABSTRACT 


Social relationships, such as interpersonal closeness or rap- 
port, can lead to improved student learning, but such dy- 
namic, interpersonal phenomena can be difficult for edu- 
cational support technologies to detect. In this paper, we 
describe an approach for rapport detection in peer tutor- 
ing, using temporal association rules learned from nonver- 
bal, social, and on-task verbal behaviors. From a corpus of 
60 hours of annotated multimodal peer tutoring data, we 
learn the temporal association between behaviors and the 
rapport score for each 30-second “thin-slice”. We then train 
a stacked ensemble classification model on those association 
rules and evaluate our ability to reliably predict rapport us- 
ing multimodal behavioral data. We find that our approach 
allows us to predict rapport well above chance, and more 
accurately than two baseline models. We are able to predict 
high rapport more accurately for strangers and low rapport 
more accurately for friends, which we believe holds promise 
for the integration of rapport detection into collaborative 
learning supports and intelligent tutoring systems. 


Keywords 


rapport, association rule mining, peer tutoring, social states 


1. INTRODUCTION 


Social relationships, such as the long-term closeness of friends 
or the short-term rapport built while getting to know some- 
one, have been shown to result in benefits for student learn- 
ing, such as increased help-seeking, productive cognitive con- 
flict, and elaborated reasoning [2]. In collaborative learning 
settings, higher interpersonal rapport between students is 
associated with productive educational processes such as 
instances of transactive reasoning [13] and greater learn- 
ing gains over time [18]. Educational technologies, such as 
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intelligent tutoring systems (ITS) and pedagogical agents, 
increasingly attempt to reap the benefits of interpersonal 
closeness and rapport between humans and agents to im- 
prove engagement, motivation, or trust in the pedagogical 
agent [19]. However, before educational technologies can re- 
spond appropriately to the rapport between collaborating 
students, or build rapport between students and a pedagog- 
ical agent, they must first model that rapport as it changes 
over time, given the available behavioral data. The educa- 
tional data mining community has developed, over the last 
several decades, detectors of individual student phenomena 
such as frustration, boredom, engagement, carelessness, and 
many others [3,17], but it has developed relatively fewer 
methods for modeling inter-personal social phenomena such 
as the rapport between members of a collaborative group or 
peer tutoring dyad. 


This paper is intended to contribute to the detection of inter- 
personal social states, such as rapport, through nonverbal, 
task (verbal) and social (verbal) channels, captured through 
audio and video input. In this paper, we describe a process 
for using temporal association rule mining to learn patterns 
of behaviors from an annotated corpus of nearly 60 hours of 
dyadic peer tutoring interactions. We then use those tempo- 
ral association rules to predict the “thin-slice” dyadic rapport 
level for every 30-second time-slice, via a stacked ensemble 
model. We find that temporal rules generated from anno- 
tations of students’ nonverbal, on-task, and off-task social 
behaviors were overall able to predict rapport at levels well 
above chance, and at nearly double the prediction perfor- 
mance (AUC) of a baseline approach. We found that this 
approach allows us to predict high rapport significantly bet- 
ter than low rapport overall, while predicting high rapport 
for strangers more accurately than for friends. 


This paper contributes to the Educational Data Mining 

(EDM) community in several ways: (1) We describe a pro- 
cess for automatically learning temporal association rules 
from annotations of nonverbal, and social and on-task ver- 
bal behaviors, and using those rules to predict rapport in 
a stacked ensemble model, compared to two baseline ap- 
proaches. (2) We describe the variation in the number of 
high-confidence rules learned for each of the behavioral chan- 
nels, to inform future developers of rapport detectors of the 
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data sources that may be most fruitful to capture. (3) We 
evaluate the predictive performance of those temporal rules 
in predicting rapport for both friends and strangers, thereby 
addressing both short- and long-term rapport. 


2. RELATED WORK 


In order to choose the behaviors used to predict rapport, we 
draw on a framework of rapport-building proposed by [22]. 
In this theoretical model, rapport is a dyadic phenomenon, 
co-constructed over time by both members of the dyad. Ac- 
cording to [22], rapport is developed through nonverbal be- 
haviors and verbal social conversational strategies that serve 
various social functions and sub-goals in rapport develop- 
ment, such as face management, mutual attentiveness, and 
coordination [22]. Our work extends [22]’s approach by also 
incorporating the task-related verbal strategies from both 
tutor and tutee, such as feedback, instructions, and task- 
related questions which are essential for the tutoring process, 
and which we hypothesize will impact, and be impacted by 
the rapport between members of a peer tutoring dyad [9]. 


Prior researchers in discourse analysis, multi-modal inter- 
action, and dialogue systems have developed detectors for 
various aspects of interpersonal relationship development, 
such as Yu et al.’s friendship prediction for peer tutoring 
dyads, which found that dyadic features such as mutual 
gaze and smile behaviors were predictive of friendship [21]. 
In prior EDM work, some [15] have used the temporal co- 
occurrence of nonverbal behaviors (operationalized as Facial 
Action Units) to capture “behavioral synchronicity” in col- 
laborative problem-solving dyads. Others have developed 
automatic classifiers of on-task-related interpersonal behav- 
iors, such as [14]’s method for classifying socio-cognitive con- 
flict in collaborative learning within an intelligent tutoring 
system. Others, such as [20], have developed automatic 
classifiers of dyadic impoliteness and positivity, work that 
we build on here with the social conversational strategies 
we incorporate into the association rules. Prior work has 
demonstrated the effectiveness of out-of-domain social talk 
in pedagogical agents, such as [8]’s social pedagogical agent 
used in collaborative learning. 


2.1 Temporal Patterns in Behavior 

As rapport-building is a dynamic phenomena, it is impacted 
by the contingent patterns of verbal and nonverbal behavior. 
Ohlssen et al. describe how popular methods for discourse 
analysis that use a “code-and-count” method [12] collapse 
the temporal dimension and are thus unable to understand 
the rich patterns of interaction likely to impact learning, 
or rapport. To address this gap, we draw on the ‘Tempo- 
ral Interval Tree Association Rule Learning (Titarl) frame- 
work [7] to discover temporal patterns of verbal and nonver- 
bal behavior and their association with the dyadic rapport 
between members of a tutoring dyad for every 30-second 
time slice. The Titarl framework has been previously used 
to analyze medical patients’ vital sign data [7], and in our 
lab, [24] have used Titarl to identify patterns of social con- 
versational strategies and nonverbal behaviors predictive of 
levels of rapport. Crucially, however, [24] did not include 
the tutoring and learning behaviors that are the heart of 
the task component of the peer tutoring interactions, and 
which are likely to impact rapport through, for example, 
the face-threatening nature of providing feedback or instruc- 
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tions [9]. Therefore, in order to more effectively predict the 
rapport between members of a peer tutoring dyad, we in- 
clude rules learned from the nonverbal, social verbal, and 
tutoring-related verbal behaviors. 


3. METHODS 


3.1 Research Questions 

RQ1: Can temporal association rules learned from social 
conversational strategies, task, and nonverbal behaviors in 
peer tutoring be used to predict rapport at levels above 
chance? From [7] and [24], we believe that they can, and 
that we can improve the predictive performance by adding 
the task-related verbal behaviors. 


RQ2: Is a classifier trained on temporal association rules 
better able to predict rapport (a) for some relationship types 
than others or (b) at some levels than others? Following [24], 
we believe we will be better able to predict high rapport 
among strangers than among friends. 


RQ3: Are temporal association rules (TAR) generated us- 
ing all three channels of on-task (verbal), social (verbal), 
and nonverbal behavioral better able to predict rapport than 
rules generated from any one or two of those behavior types? 
From [9, 21,24], we believe that including task, social, and 
nonverbal together will perform the best. 


3.2 Data Collection and Dialogue Corpus 

The dialogue corpus described here was collected as part 
of a larger study on the effects of rapport-building on re- 
ciprocal peer tutoring [9, 10, 18,22]. The participants were 
assigned to 12 dyads that alternated tutoring each other in 
Algebra for 5 weekly hour-long sessions, for a total of 60 
hours. Half were male and half were female, assigned to 
same-gender dyads. To investigate how the impact of vari- 
ous task, social, and nonverbal behaviors on rapport differs 
between dyads with varying degrees of interpersonal close- 
ness, we used friendship as a proxy for long-term rapport 
and thus asked half of the participants to bring a same-age, 
same-gender friend to the session with them, and for the 
other half of the dyads, we paired them with a stranger, 
using the 5 weeks to capture short-term rapport-building. 
Audio and video data were recorded, transcribed, and seg- 
mented for clause-level dialogue annotation. 


3.3. Thin-Slice Rapport Ratings 


The rapport between the participants, was evaluated using 
a ’thin-slice’ approach [1]. First, the corpus was divided into 
30-second video slices, then shuffled (so the raters did not 
inadvertently rate the change in rapport), and provided to 
naive, third-party raters. ‘Three such raters rated the rap- 
port present in each slice on a Likert scale from 1-7, from 
lowest possible rapport to highest possible rapport. A sin- 
gle rating was then chosen for each slice using an inverse 
bias-corrected weighted majority vote approach, described 
in [18], to account for potential over-use or under-use of cer- 
tain labels by the raters. The final consensus measure of 
inter-rater reliability, or Cronbach’s a, was .86, justifying 
the use of this rating selection method [18]. This rating was 
used when learning the associations between the task, social, 
and nonverbal behaviors and the rapport level. 
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Table 1: Annotation Types, Labels, Definitions, and Examples 


Label 


Definition 


Knowledge-telling Stating procedures or the answer 


Divide it by 9. 


Divide by 


Self-Disclosure Sharing personal information about oneself I suck at negative numbers. 


Refer to Shared Experience | Discussing an experience they had together Remember that soccer game? 
Violation of Social Norms Statements that break social conventions It’s a zero, dummy. 


Positive acknowledgment of the other 


You’re so | —-- Youre sosmart! = 


Sonal Pecirocaion Responding to a conversational move Tutor =. discloses, then 
BP with the same conversational move. the tutee self-discloses 


3.4 Dialogue Annotation 

To investigate the impact of rapport-building verbal (so- 
cial and task) and nonverbal behaviors, we annotated our 
dataset for 3 types of nonverbal behaviors, 5 types of so- 
cial conversational strategies, and 5 types of tutoring and 
learning behaviors, as shown in Table 1, all annotated with 
> .7 Krippendorft’s a. The nonverbal behaviors annotated 
were head nods, smiles, and shifts in eye gaze from the part- 
ner, to the Algebra worksheets, to anywhere else, similar 
o [21]. The social verbal behaviors were chosen according 
o [22]’s theory of rapport-building, behaviors such as self- 
disclosure, reference to shared experiences, violation of social 
norms, and others. The on-task verbal behaviors annotated 
are based in part on [16]’s work on knowledge-telling and 
knowledge-building, as well as [6] work with procedural and 
conceptual questions, described in more detail in [10]. 


3.5 Temporal Association Rule Mining 

To investigate the impact that these nonverbal, task, and so- 
cial behaviors had on rapport at a 30-second thin-slice level, 
we adopted a temporal association rule mining approach, 
following [23]. The framework we use, the “Temporal Inter- 
val Tree Association Rule Learning” (Titarl) algorithm [7], 
allows us to identify temporal patterns of behaviors within 
each time slice that are probabilistically associated with the 
value of rapport for that slice. For each 30-second time win- 
dow, a rule is learned much like the generic rule below. 


“If event A happens at time t, there is 50% chance of event 
B happening between time t+8 to t+5”. 


Our data is comprised of both multivariate symbolic time 
sequences (the nonverbal, task, and social behaviors) and 
multivariate scalar time series (the rapport value for each 
slice). The Titarl algorithm will learn a large set of rules 
on a subset of our data (the training set), filter those rules 
based on a set of parameter thresholds, fuse similar simple 
rules into more complex rules, which we then use in pre- 
dicting rapport on a held-out test set. Because we believed 
that the ways that friends and strangers build rapport with 
each other over 5 weeks are likely to differ following [23], we 
ran the Titarl algorithm on sets of friend dyads and sets of 
stranger dyads separately. 


3.6 Rapport Detection Process 
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Figure 1: Multi-step process for prediction of rap- 
port using temporal association rule mining and a 
stacked classifier ensemble. 


We describe here an approach laid out in Figure 1. We 
first divided our 6 friend dyads and 6 stranger dyads, with 
5 sessions per dyad, into a training set of 4 dyads (20 to- 
tal sessions) and a held-out test set of 2 dyads (10 total 
sessions) for both relationship types. Then, (Step 1) we 
created seven combinations of the Social, Task, and Nonver- 
bal annotations described in Table 1, to identify differences 
in prediction performance for the different behavior types 
(RQ3). Next (Step 2), for each of those behavior combina- 
tions, we created a matrix M with n+1 columns, with n = 
the total number of annotation types (used by the tutor and 
the tutee), described in Table 1, with the first column in M 
being the start time, in seconds, of each behavior. Each row 
in M was an event, or the start of an annotated verbal or 
nonverbal behavior. From each matrix M, we generated an 
“event file” which included the behavior sequence as well as 
the scalar time series of the rapport value for the 30-second 
time slice within which those behaviors occurred. 


Then, using these files, we (Step 3) learned a set of asso- 
ciation rules R for each training set, using the Titarl algo- 
rithm [7]. These rules contain a head, which is the scalar 
output value of rapport (an integer from 1-7), and a body, 
which is the ordered set of annotated behaviors used to pre- 
dict the rapport in each slice. Prior to learning, we specified 
the minimum confidence (the probability of the prediction of 
the rule to be true) at 50%, the minimum support (the per- 
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Figure 2: Example temporal association rule for 
strangers with high rapport, with 100% confidence, 
9% support, and 44 uses. 


centage of events explained by the rule) at 5%, and the min- 
imum number of uses for each rule at 10, following [24]. An 
example of a rule can be seen in Figure 2, where a Tutor’s use 
of Praise (PR), followed by the Tutee’s “Knowledge-telling” 
(KT), or self-explanation, is associated with a rapport value 
of 6 (high), with confidence of 100%, support of 9%, and 
44 uses in that model. This rule was learned from a Task 
and Social behavioral model, for a dyad of Strangers. The 
nature of these data can be further illustrated with another 
example, from a highly confident association rule learned 
from the Task and Social model, for dyads of Friends. The 
following high-confidence rule is associated with Rapport of 
1 (low): a Tutee asks a Shallow Question, receiving four 
“Knowledge-telling” utterances in a row from their Tutor, to 
which the Tutee responds with a “Social Norm Violation”. 
In other words, the tutee asks about the procedure, the tutor 
tells him what to do in multiple utterances, and the tutee 
responds with some norm violation, perhaps rudeness. ‘To 
ensure that the rules learned from each set of dyads were not 
overfit to the particular training set of dyads used to learn 
them, we learned a rule set (i.e. repeated Steps 1-3) for all 
possible combinations of the 6 friend and 6 stranger dyads, 
resulting in 15 “folds” for friend dyads and 15 for strangers 
(i.e. choosing all possible sets of 4 dyads to use as training 
sets from the 6 total dyads). Each fold had several hundred 
association rules learned above our threshold for confidence, 
support, and usage. In Figure 3, we show the mean number 
of rules learned, showing only those with confidence, sup- 
port, and usage above the median for ease of visualization. 


After learning the rules, in Step 4 we use the rules to train 
random forest classifiers to predict the rapport level for each 
30-second slice. To do this, we first generated a matrix N 
for each rule set in each of the 30 training sets, with a row 
for each rule event in that set, and n+i1 columns, where 
n is the number of rules in that train set, and the final 
column was a binary indicator of the rapport value for that 
time slice. We ran 7 random forest classifiers (one for each 
rapport level) for each matrix N, for each of the 15 folds of 
friends and 15 folds of stranger training sets, giving us (in 
Step 5) a prediction probability estimate for each of the 7 
rapport values, for each event in every fold, for each of the 
7 behavioral channels (from Step 1). Finally, we wanted to 
evaluate the relative impact of those 7 behavior types, and 
so we composed different combinations of nonverbal, social, 
and task behavior. We then, in Step 6, use the prediction 
probability output by the random forest classifiers as the 
input features in training a single multi-class Support Vector 
Machine (SVM) classification model for each of the 30 folds 
to predict the overall rapport level for each time slice in that 
fold. In the following section, we discuss the performance of 
this final classification step in predicting rapport for each 
relationship type and evaluate its performance against two 
baselines from earlier steps in the process. 


Number of Rules 
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Figure 3: Mean number (and standard error) of 
rules learned from 7 behavioral channels, with high 
confidence, support, and usage, for friends and 
strangers with low, neutral, and high rapport. 


4. RESULTS 


First, before investigating our first research question about 
the performance of our approach in predicting rapport, we 
wanted to inspect the total number of rules learned from 
each behavioral channel with high confidence, support, and 
usage, to better understand the extent to which the number 
of highly confident temporal rules varied for each behavioral 
type. See Figure 3 for the mean number and standard er- 
rors for rules learned above the median confidence, support, 
and usage for low, neutral, and high rapport for friends and 
strangers. Based on the distribution of slices at each level, 
we converted the 7 scalar rapport values to the low (1-3), 
neutral (4), and high (5-7) rapport levels. 


We see that friends had significantly more (t(20.8)=2.7, p=.01) 
high-confidence Social and Nonverbal (“SNV”) rules learned 
in High Rapport slices than the next largest behavioral chan- 
nel, the “TSNV” channel, combining Task, Social, and Non- 
verbal. This suggests that a method for detecting high rap- 
port between friends that uses Social and Nonverbal behav- 
iors will have many more high-confidence, frequently occur- 
ring rules with which to predict rapport than using other 
sets of behavior types. Conversely, for rules learned from 
Friend dyads for Low Rapport slices, there is a significantly 
(t(26.6)=2.6, p<.05) greater number of high-confidence, fre- 
quently occurring Task (“T”) rules than rules learned from 
the Social and Nonverbal (SNV) behaviors. That is, there 
are substantially more high-confidence, high-support, and 
frequently occurring ways in which Friends displayed Low 
Rapport through their on-task behavior (and on-task com- 
bined with nonverbal, “TNV”) than through other avail- 
able channels. This suggests that a method for detecting 
students’ low rapport, for a dyad of friends, may benefit 
from incorporating the task-related behaviors such as in- 
structions, explanations, questions, and provision of feed- 
back in addition to purely social behaviors, as in [23]. Sim- 
ilarly, for Strangers, their Task and Social (“TS”) channel 
had the largest number of rules learned associated with High 
Rapport slices, significantly more than the “SNV” behaviors 
(t(26.8)=1.2, p<.05), though not significantly more than the 
TSNV behaviors. This suggests that a detector of high rap- 
port that leverages Task and Social behaviors may have more 
high-confidence association rules from which to draw for its 
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Table 2: Average PR-AUC (and standard deviation) 
of 3 rapport prediction models 


Model PR-AUC 


IF Baseline 42 (.07) 


RF Baseline 33 (.03 


TAR Ensemble .60 (.08 


classification of high rapport for students without a prior 
friendship relationship (i.e. “strangers”) than one relying 
solely on Strangers’ social and nonverbal behaviors. 


Then, to evaluate the overall performance of our approach 
in predicting low, neutral, and high rapport, we used the 
prediction probability from the 7 binary random forest clas- 
sifiers (from Step 5) as the input into a 3-way one-vs-rest 
SVM classifier, for every behavioral channel model (Step 
6). We first ran a 10-fold cross-validated grid search on our 
training set to discover the optimal set of parameters to use 
for the SVM model, using an RBF kernel, with C=10 and 
y = 1. From the SVM, we use the average area under the 
Precision-Recall curve (PR-AUC) for each of the 7 behav- 
ioral models as our performance measure, following [5]. 


First, for RQ1, to validate the appropriateness of our stacked 
ensemble approach (“TAR Ensemble”), we compare its pre- 
diction performance to two baseline approaches. We com- 
pare first to a baseline that treats the annotated behaviors 
in each slice as independent features in an SVM using the 
same parameters (“IF Baseline”). The TAR Ensemble signif- 
icantly (t(413) = 24.4, p<.001) outperforms the IF Baseline 
with a mean AUC of .60 (sd = 08) for the TAR Ensem- 
bles, compared to a mean AUC of .42 (sd = .07) for the 
IF Baseline. We then compare the TAR Ensemble to an- 
other baseline (“RF Baseline”) that simply takes the largest 
prediction probability from the 7 random forests (Step 5 in 
Figure 2) as the predicted class value, using random selec- 
tion for ties. The TAR Ensemble significantly (t(256) = 46, 
p<.001) outperforms the RF Baseline by nearly 2 to 1, with 
a mean AUC of .60 (sd = 08) for our approach and a mean 
AUC of .33 (sd = .03) for the RF Baseline. See Table 2 for 
a summary of the PR-AUC values for each model. 


For RQ2a, we find that the Stacked Ensemble is better able 
to predict High Rapport than Low (t(417)=5.9, p<.005). 
For RQ2b, we are better able to predict Low Rapport for 
Friends than Strangers (t(197) = 5.8, p<.001). Conversely, 
we are better able to predict the rapport among Strangers 
than among Friends for both Neutral (t(206.5) = 5.5, p<.001) 
and High rapport levels (t(207) = 2.7, p<.01). For RQ3, 
no single set of behavioral channels significantly outper- 
formed the others, in an ANOVA of the PR-AUC measure 
with each relationship type (Friend/Stranger), rapport level 
(Low/Neutral/High), and behavioral type (TS, TSNV, etc). 


5. DISCUSSION AND CONCLUSION 


Interpersonal social dynamics provide the grounding for learn- 


ing interactions, whether students are learning collabora- 
tively, in peer tutoring, or working with their classroom 
teacher or even a virtual agent. However, technological sup- 
ports for learning often focus on detecting and modeling 
individual, intra-personal states such as students’ affect or 
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engagement, without considering the latent social state un- 
derpinning their interactions with others. In this work, we 
present one method for detecting the latent social state of 
interpersonal rapport in learning interactions, using a tem- 
poral association rule mining approach to learn patterns 
of nonverbal and verbal (social and task) behaviors, as in- 
put in predicting the rapport level in a stacked ensemble 
model. Our ensemble approach outperforms two baselines, 
(1) the independent behaviors as features, and (2) the ran- 
dom forests trained on the temporal association rules. 


We find that, overall, our approach is better able to predict 
high rapport than low rapport, and it predicts high and neu- 
tral rapport more accurately for Strangers than for Friends, 
while predicting low rapport more accurately for Friends 
than for Strangers. This is good news for designers of vir- 
tual agents that want to detect and build rapport with a new 
student, or designers of computer-supported collaborative 
learning technologies that want to detect rapport in learning. 
However, contrary to our expectations (for RQ3), we saw no 
significant difference in prediction performance across the 
models generated from different combinations of behavior 
types (e.g. SNV, TSNV, etc). We did see a significant dif- 
ference in the total number of association rules learned from 
those behavior types, however, suggesting that rapport de- 
tectors will be better able to predict rapport if they use the 
behavior types that occur more frequently in learning. For 
instance, a rapport detection method for strangers that in- 
corporates Task and Social behavior will have significantly 
more high-confidence, high-support association rules with 
which to detect the rapport between them. 


One of the limitations of this current approach is that, while 
it may reach quite good levels of performance in detecting 
rapport, the large number of rules learned make it difficult 
to identify the specific rules that are most predictive of rap- 
port, in addition to concerns about dimensionality. This 
work is limited by the small sample size, and by being re- 
stricted to same-gender dyads; using a larger set of dyads 
to conduct these analyses may reveal differences in predic- 
tion performance for different behavioral types (social, task, 
nonverbal), if they exist. We have currently finished collect- 
ing 22 dyads’ worth of interactions among strangers (over 
40 hours), and we will be conducting a similar set of anno- 
tations and analyses on them. In this paper, the thin-slice 
rapport ratings and annotations were hand-annotated from 
a corpus of audio/video data, limiting the automaticity of 
this approach. However, we are in the process of moving 
to a crowd-sourced method for obtaining the ground truth 
rapport ratings for each 30-second slice. Preliminary ex- 
periments for crowd-sourcing the thin-slice rapport annota- 
tion using Amazon Mechanical Turk have yielded a Krip- 
pendorff’s a of 0.69 across 3 raters for each thin-slice. 


In future work, we intend to use this rapport estimation 
method for a rapport-building virtual agent in an intelligent 
tutoring system. We have developed automatic classifiers for 
the three types of nonverbal behaviors described here, using 
the OpenFace system [4], and social conversational strategy 
classifiers, such as those described by [23], classifiers which 
have already been integrated into a “socially aware robot 
assistant” (SARA), as described by [11]. Our next step is 
to develop a task-related classifier, perhaps similar to that 
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used in [14], to recognize students’ task-related utterances as 
part of the rapport estimation and reasoning about natural 
language response generation. We believe that this paper 
contributes to the larger goal of educational data mining 
by demonstrating one approach to using multimodal data 
to model latent social phenomena important to learning, in 
this case the interpersonal rapport in peer tutoring. 
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ABSTRACT 


Modeling student knowledge while students are acquiring 
new concepts is a crucial stepping stone towards provid- 
ing personalized automated feedback at scale. We believe 
that rich information about a student’s learning is captured 


within her responses to open-ended problems with unbounded 


solution spaces, such as programming exercises. In addi- 
tion, sequential snapshots of a student’s progress while she 
is solving a single exercise can provide valuable insights into 
her learning behavior. Creating representations for a stu- 
dent’s knowledge state is a challenging task, but with re- 
cent advances in machine learning, there are more promis- 
ing techniques to learn representations for complex entities. 
In our work, we feed the embedded program submission se- 
quence into a recurrent neural network and train it on two 
tasks of predicting the student’s future performance. By 
training on these tasks, the model learns nuanced represen- 
tations of a student’s knowledge, exposes patterns about a 
student’s learning behavior, and reliably predicts future stu- 
dent performance. Even more importantly, the model dif- 
ferentiates within a pool of poorly performing students and 
picks out students who have true knowledge gaps, giving 
teachers early warnings to provide assistance. 


Keywords 

Educational data mining; Online education; Personalized 
learning; Knowledge tracing; Machine learning; Represen- 
tation learning; Sequential modeling. 


1. INTRODUCTION 


With the inception of online learning platforms, educators 
around the world can reach millions of students by dissem- 
inating course content through virtual classrooms. How- 
ever, in these online environments, teachers’ ability to ob- 
serve students is lost. Understanding a student’s incremen- 
tal progress is invaluable. For instance, if a teacher watches 
a student struggle with an exercise, they see the student’s 
strengths as well as their knowledge gaps. The process by 
which the student reaches the final solution is equally as im- 
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portant as the solution itself. We attempt to encode these 
markers of progress. We performed representation learn- 
ing with recurrent neural networks to understand a stu- 
dent’s learning trajectory as they solve open-ended program- 
ming exercises from the Hour of Code course, a MOOC on 
Code.org. ‘The deep learning model trains on a student’s 
history of past code submissions and predicts the student’s 
future performance on the current or the next exercise. The 
model is able to learn meaningful feature representations for 
a student’s series of submissions and hence does not require 
manual feature selection, which would be very difficult for 
open-ended exercises. Furthermore, the learned representa- 
tions can be used for other related tasks, such as predicting 
an intervention. 


1.1 Motivation: Instructional Scaffolding 

The widely used pedagogical concept of the zone of proximal 
development (ZPD) suggests that ideal learning objectives 
are in a sweet spot of difficulty called the ZPD: more difficult 
than what the student can accomplish on their own, but 
not so difficult that they cannot succeed even with guidance 
[3, 24]. The guidance for accomplishing such challenging- 
but-achievable objectives is called instructional scaffolding, 
and it is most effective when personalized to each student’s 
mastery of the material [18]. 


Scaffolding is particularly difficult in MOOCs-it is hard to 
personalize instruction to thousands of students at once. 
While some research has explored the merits of academic 
habit scaffolding [6] or reciprocal scaffolding with peer col- 
laboration [19] in MOOCs, the most promising work lies 
in expert scaffolding, which involves an expert, usually a 
teacher, in the relevant domain of knowledge providing guid- 
ance to help students acquire knowledge |?]. Effective teach- 
ers possess pedagogical content knowledge (PCK), or exper- 
tise about not only the domain of knowledge, but also how to 
best teach that material to learners [21]. Most importantly, 
PCK helps anticipate where students will struggle. 


In existing MOOC research, the expert scaffolding usually 
takes the form of feedback to students’ responses on assign- 
ments. Yet, many current systems for automating feedback 
in MOOCs relies on time-consuming and potentially arbi- 
trary tasks of feature engineering [20] or defining rulesets 
[22] applicable only to single exercises. This manual en- 
coding of PCK is task-specific and not a generalizable un- 
supervised process. A more generalizable signal of student 
failing learning objectives is student attrition from MOOCs. 
Limited work exists exploring correlations between attrition 
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and student engagement with MOOC materials [27] or other 
students (e.g. on discussion forums) [17, 26]. However, to 
the authors’ knowledge, existing MOOC attrition research 
does not control for student achievement. Often, attrition is 
merely a downstream symptom of struggling with learning 
objectives. When students underachieve, their self-concept 
of themselves as learners may be threatened, which recur- 
sively reinforces lower achievement and disengagement [4]. 
In general, anticipating common domain-specific mistakes 
with PCK can help preempt them and mitigate subsequent 
disengagement, and thus the unsupervised anticipation of 
student mistakes is a worthwhile objective for automated 
systems that can ultimately improve learning. 


2. RELATED WORK 


Representation Learning with Neural Networks 
In the field of machine learning, representation learning is 
the task of learning a model to create meaningful represen- 
tations from low-level raw data inputs. The goal of repre- 
sentation learning is to reduce the amount of human input 
and expert knowledge needed to preprocess data before feed- 
ing it into machine learning algorithms [1]. In contrast to 
manually selecting high-level features, representation learn- 
ing algorithms are trained to extract features directly from 
raw input, e.g. from words in a document. The combina- 
tion of linear functions and nonlinearities stacked in layers 
allows deep neural networks (DNNs) to learn abstract rep- 
resentations in an efficient manner [1]. Empirically, DNNs 
do particularly well when the data has high semantic com- 
plexity and manually choosing features is not only tedious, 
but often insufficient. Once the representations are trained 
on one task, they can be used for other related tasks as well. 
E.g. In word2vec [12], word representations were trained on 
predicting context words but were then used for document 
classification and translation. Empirically, DNNs do partic- 
ularly well when the raw data has high semantic complexity 
and manually choosing features is not only tedious, but of- 
ten insufficient. Recurrent neural networks (RNNs) are a 
subtype of neural networks which take inputs over multiple 
timesteps and are therefore well-suited for learning repre- 
sentations on sequential data with temporal relationships. 


2.1 Program Code Embeddings 

In order to expand DKT to understand students as they pro- 
duce rich responses over time within an exercise, a necessary 
step is to create meaningful embeddings of their program 
submissions. Piech et al. proposed to use recursive neural 
networks to create program embeddings for student code[15]. 
Recursive neural networks that learn embeddings on syntax 
trees were first developed by the NLP community to vector- 
ize sentence parse trees [23], but are even more applicable 
to computer programs due to their inherent tree structure, 


since any program can be represented as an Abstract Syntax 
Tree (AST). 


2.2 Knowledge Tracing (KT) and Deep KT 


The task of knowledge tracing can be formalized as: given 
observations of interactions rp... x4 taken by a student ona 
particular learning task, predict aspects of their next interac- 
tion x+41 [5]. Piech et al. applied RNNs to data from Khan 
Academy’s online courses to perform knowledge tracing by 
predicting student performance [14]. The authors found 
that RNNs can robustly predict whether or not a student 
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Figure 1: Exercise 18 in the Code.org Hour of Code. Left, the 
programming challenge. Right, the solution. The challenge is to 
program the squirrel to reach the acorn, while using as few coding 
blocks as possible. https: //studto. code. org/hoc/18. 


will solve a particular problem correctly given their perfor- 
mance on prior problems. Other models that are designed 
to take low dimensional inputs, such as IRT and modifica- 
tions of Bayesian Knowledge Tracing [28] [13], sometimes 
outperform the initial version of Deep Knowledge Tracing 
(DKT) [25] [10]. However, DKT does not require student 
interactions to be manually labeled with relevant concepts 
and the RNN paradigm was designed to take vectorized in- 
puts, hence it can utilize inputs that extend beyond the 
discrete inputs of traditional models [7]. These properties 
make the model an appropriate fit to understand trajecto- 
ries of open-ended student responses, which have unbounded 
input spaces. 


A limitation with the work of Piech et al. is that it does not 
fully leverage the promise of using neural networks to trace 
knowledge. ‘The dataset they used only contained binary 
information about a student’s final answer (i.e. correct or 
incorrect). In contrast, the Hour of Code dataset comprises 
program submissions that each have a boundless solution 
space. These infinite variations represent richly structured 
data which we can encode as program embeddings. ‘The 
ideas presented in this paper work towards a model with 
the representative capacity to tackle open-ended knowl- 
edge tracing [9]. In addition, previous work in deep knowl- 
edge tracing has looked at student responses over multiple 
exercises, but not within an exercise. Our method focuses 
on a student’s sequence of submissions within a single pro- 
gramming exercise to predict future achievement. We model 
student learning and progress by capturing representations 
of the current state of a student’s knowledge as they work 
through the exercise and incrementally submit programs. 
When focusing exclusively on the final submission, these in- 
cremental steps are ignored. 


3. EXPERIMENTS: TASK DEFINITIONS 


In order to create representations of a student’s current state 
of knowledge, we chose the two following training tasks: 


e Task A: 
Based on a student’s sequence of k code submission 
attempts over time (hereby, their “trajectory”) T = 
[AST|, AST»2,..., AST;] on a programming exercise, pre- 
dict at the end of the sequence whether the student will 
successfully complete or fail the next programming ex- 
ercise within the same course. 


e Task B: 
At each t <k, given a student’s sequence thus far of t 
code submission attempts T = |AST,, AST2,..., AST:| 
on a programming exercise, predict whether the student 
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will successfully complete or fail the current program- 
ming exercise. 


Task A is pedagogically comparable to predicting whether 
or not a student will be able to learn a new concept given the 
way they did or did not learn previous concepts. Phrased 
differently, a student who quickly (e.g. in few time steps) 
demonstrates some level of mastery of material (i.e. the 
goodness of their final submission) should be considered 
more likely to outperform a student who took a long time 
and may have struggled before eventually demonstrating the 
same level of mastery. Meanwhile, Task B is pedagogically 
comparable to detecting whether or not a student is strug- 
gling to acquire the present concept as they incrementally 
engage with the learning objective. In other words, teach- 
ers can get real-time information about the learning of the 
students. We expect Task B to be more difficult but also 
more pedagogically powerful. Task B is unlike Task A in 
that Task B does not use the full trajectory of a problem, 
which would contain post-hoc knowledge of whether or not a 
student gave up in an earlier learning interaction, for pre- 
diction. All of the students used as inputs in Task B can 
be considered not attrited at least at some point in the pre- 
diction task. Critically, success on Task B would enable 
teachers to predict at-risk students who may eventually give 
up and not complete the exercise but have not yet given up, 
where in Task A, by the time a teacher knows a student 
has given up one exercise (the inputs of A) as you are try- 
ing to guess their success on the next exercise, it may be 
too late to get the attrited student to rejoin in the learning 
environment (e.g. re-enroll after dropping out). 


4. DATASET: HOUR OF CODE EXERCISE 


The Hour of Code course consists of twenty introductory 
programming exercises aimed at teaching beginners funda- 
mental concepts in programming. Students build their pro- 


grams in a drag-and-drop interface that pieces together blocks 


of code. The number of possible programs a student can 
write is infinite since submissions can include any number 
of block types in any combination. A student can run their 
code multiple times for any exercise. ‘These submissions 
provide temporal snapshots to track the student’s learning 
progress. The student submission data for Exercises 4 and 


18 from this course are publicly available on code.org/research. 


For our experiments, we focus on the sequences of interme- 
diate submissions on Exercise 18. We chose Exercise 18 
(over Exercise 4) because it covers multiple concepts such 
as loops, if-else statements, and nested statements, resulting 
in more complex and varied code submissions. This Exer- 
cise 18 data set contains 1,263,360 code submissions, and, in 
turn, more varied trajectories of student learning, of which 
79,553 are unique, made by 263,569 students. 81.0% of these 
students arrived at the correct solution in their last submis- 
sion. In comparison, there were 1,138,506 code submissions, 
of which only 10,293 were unique. The 509,405 students who 
attempted Exercise 4 succeeded at a 97.8% rate. 


Since the Hour of Code exercises do not have a bounded 
solution space, students could produce arbitrarily long tra- 
jectories. We noted that the accuracy of student submissions 
have a high correlation with trajectory lengths. For instance, 
the vast majority of students with trajectory length 1 solved 
the problem with their very first submission. Hence, for 
both tasks A and B, we chose to only include trajectories 
of length 3 or above. Pedagogically, we are also more inter- 
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ested in students who don’t get the answer right away, and 
we speculate that longer trajectories should roughly corre- 
late with greater struggling with the learning objective. 


5. MODELS 
5.1 Recurrent Neural Network Model for Stu- 


dent Trajectories 

Since we would like to capture aspects of a student’s learn- 
ing behavior over time, RNN’s are a suitable neural net- 
work architecture for our experiments, as RNN’s have em- 
pirically performed well on sequence modeling tasks in other 
domains. For both tasks A and B, we used a Long Short 
Term Memory (LSTM) RNN architecture, which is a pop- 
ular extension to plain RNNs since it reduces the effect of 
vanishing gradients [8]. A student’s trajectory consists of k 
program submissions, which are represented as AST's. Note 
that an AST contains all the information about a program 
and can be mapped back into a program. These AST's are 
converted into program embeddings using a recursive neural 
network similar to the one described in [15]. The program 
embedding is a more compact representation of the original 
AST, which captures aspects of the program; in particular 
its functionality. This sequence of program embeddings gets 
fed into an RNN, as illustrated in Figure 2. 

For task A, we used a three layer deep LSTM. To make 
the prediction at the end of the sequence, we pass the hid- 
den state at the last timestep through a fully connected layer 
and a subsequent softmax layer. The output 7 of the softmax 
layer is an estimated probability distribution over two binary 
classes, indicating whether the student successfully solved 
the next exercise. For task B, we built a dynamic three 
layer LSTM, which makes a prediction at every timestep t 
based on the hidden state at t. Hence, if a student submits 
three times, we will use the sequence thus far to make three 
predictions. 


5.2 Baselines 

Task A: The goal here is to show that our model can learn 
from the program embeddings alone whether a student is 
likely to succeed on the subsequent exercise and contrast 
its performance against the state of the art baseline using 
handpicked features. For the baseline, we chose the follow- 
ing two features for a student’s trajectory 7’, which have 
been shown to be highly correlated with learning outcome 
and performance on the next exercise and trained a logistic 
regression model. 


1. The Poisson path score of the trajectory T' as 
defined in [16]. Intuitively, the path score is an estimate 
of the time it will take a student to complete the trajectory 
series. The path score of a student trajectory has previously 
been related to student retention in sequential challenges 
[16]. pathScore(T) = ) ner x where A, is the number of 


times AST x appears in student submissions. 


2. Indicator feature of student success on current 
exercise 18. A student succeeded if they ended the trajec- 
tory with the solution AST. 


Task B: Here, we would like to demonstrate that an LSTM 
is able to capture more information about a student’s tra- 
jectory and capture the temporal relationships within the 
sequence. Hence, we picked logistic regression on program 
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Figure 2: Simplified Sequential RNN Model. For Experiment A, the model only predicts at the last timestep. For Experiment B, the 
model predicts at every timestep. Note that the RNN can be unrolled any number of times, since the parameter weights are shared across 
tumesteps. Note that our models for both tasks stack multiple LSTM layers to increase expressivity. 


embeddings as our baseline. Since logistic regression cannot 
take in a sequence of embeddings, we consider every embed- 
ding within a trajectory sequence as a separate sample that 
we pass into the logistic regression model. Hence, this model 
ignores any previous temporal information; e.g. at timestep 
t, it ignores all embeddings from timestep 1 to t — 1. Note 
that this is fairly high baseline, since we feed in program 
embeddings which are learned using neural networks. We 
also included a random baseline as a sanity check. 


6. RESULTS 


6.1 Quantitative Results 

Task A: For both the pathscore baseline model and the 
LSTM model, we used 90% of the data set to perform train- 
ing and validation and the remaining 10% for testing. The 
LSTM model consistently outperforms the path score base- 


line by around 5% on test accuracy at every trajectory length. 


This result is significant since the input we feed into the 
LSTM model consists of program embeddings, and not hand- 
picked features like success on current problem. Our model 
identified trajectories that show more promise. The abil- 
ity to understand trajectories suggests that the representa- 
tions used for the programs within the trajectories were also 
meaningful. The program embeddings were trained to pre- 
dict the output of any given student program. Our program 
embedding model was able to correctly predict the output 
for 96% of the programs in a hold out set, compared to a 54% 
accuracy from always predicting the most common output. 


Task B: We trained on trajectories of variable lengths 5 to 
15, using 90% for training/validation and 10% for testing. 
At every timestep, we perform a binary prediction. Let’s call 
these two classes “success” and“failure”. Since the “failure” 
class is pedagogically more important, we reported recall, 
precision and F'1 score for the “failure” class at each timestep 
for our LSTM model as well as for the logistic regression 
baseline and the random baseline (see Figure 3). We can 
observe that logistic regression on program embeddings ap- 
pears to be a very strong baseline. This is potentially due to 
a high correlation between certain ASTs and the “success” or 
“failure” classes. Our model does particularly well on recall 
on the “failure” class, which is pedagogically more impor- 
tant than precision. In education settings, it is much worse 
to miss students who will fail then giving superfluous sup- 
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port to students who would be successful anyway. It is also 
worth noting that with increasing number of timesteps, the 
gaps between the LSTM model and the logistic regression 
baseline on recall and F'1 are increasing. In particular, while 
recall and precision roughly remain constant after timestep 
5 for logistic regression, recall is improving significantly for 
the LSTM while precision stays roughly constant. 


6.2 Analysis of Trajectory Representations 
The hidden layer outputs of the trained neural net can be 
interpreted as the learned feature representations. Input 
samples that share patterns in the context of the learned 
task should ideally be mapped close to each other in the 
representation space. 


Visualizing the learned representations of a neural net is an 
empiric method to explore what the neural net has learned. 
t-Stochastic Neighbor Embedding (t-SNE) [11] is particu- 
larly suited for visualization of high-dimensional data, as it 
can uncover structures at different scales. Figure 4 shows 
a t-SNE visualization of student trajectory embeddings for 
trajectories of length 6. We can observe five distinct clus- 
ters, labeled A through E, which we were also able to iden- 
tify using the K-Means clustering algorithm with number of 
centroids set to 5. Each cluster contains trajectories sharing 
some high-level properties. Some statistics for the clusters 
are summarized in table 1. 


Within these clusters of student trajectories, qualitative anal- 
ysis found 3 distinct learning groups. Cluster A contains 
the best students who make consistent progress, showing log- 
ical debugging steps to apply programming concepts. Each 
step fixes an existing error and moves towards the correct 
final solution. A notable differentiator for Cluster A stu- 
dents is that they did not return to sections of their solution 
that they had already corrected. This demonstrates compre- 
hension of the error and that they have digested the concept. 


Students in Clusters C and E make inconsistent progress 
and show signs of random guessing. Some students method- 
ically test combinations of elements to engineer a passing 
solution. This behavior likely represents uncertain or dis- 
trusted knowledge. This kind of behavior is overlooked by 
the current grading system as Code.org only considers cor- 
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Figure 4: t-SNE visualization of trajectory representations. Left: Ground truth student success for current challenge. Center: Predic- 
tions for next challenge given student trajectories on this challenge. Right: Ground Truth student success for next challenge. 


rectness of the final submission when scoring, a ”number- 
right scoring” policy. The alternative is a “negative marking” 
policy, which would penalize students for wrong submissions 
along with the final answer. Educators have found that 
number-right scoring is a less reliable grading policy that 
overestimates student achievement particularly for students 
with more distrusted knowledge because it obscures whether 
responses represent true understanding or a lucky guess [2]. 
Anecdotally, we speculate that the students’ success in the 
next problem may come from being able to reverse-engineer 
conceptual knowledge through repeated guessing. 


Students in Clusters B and D appear to miss important 
concepts tested in this exercise. Students in Cluster D used 
an average of 9.21 blocks for every solution (see Table 1), al- 
most twice as many total blocks as other clusters. Rather 
than solving the challenge with one generalized program, 
they break the challenge down into segments, hardcoding 
steps to pass each segment. Students in Cluster D have the 
highest usage of move forward blocks and turn blocks since 
students rely on these simple elements rather than the more 
complex if-else and while statements, both crucial learning 
components of this challenge. An ideal solution would in- 
clude one if-else statement and one while loop. Students in 
Cluster D used the if statement only an average of 0.87 
times and the while statement 0.60 times. Inspection of their 
programs show that students in Cluster B and Cluster D 
often disregard the while statement completely, unlike other 
clusters where students’ solutions consistently contain the 
while loop) even if used incorrectly or inefficiently. 


In summary, this analysis shows that our model can create 
more nuanced representations that lead to better predictions 
than a model that only looks at binary success indicators. 
Given that all students in Clusters B, C, D, and E per- 


Proceedings of the 10th International Conference on Educational Data Mining 


Table 1: Statistics on student clusters (K-means) 
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formed poorly on the current exercise, a binary input model 
analyzing student success on Exercise 18 could not have dis- 
tinguished between these poorly performing students. How- 
ever, our model predicted that students in Clusters C and 
E, despite getting an incorrect answer for Exercise 18, would 
be successful on the next exercise. See Figure 4 Left and Fig- 
ure 4 Center. Students in Cluster C went from a success 
rate of 0% in the current problem to a success rate of 48% 
in the next problem. Students in Cluster E went from 
15% to 71%. This high success rate for Clusters C and 
E is visually noticeable in Figure 4 Right. The students’ 
learning trajectories provided our model information to un- 
derstand the students’ learning at a deeper level and make 
these nuanced predictions, validating the claim that analyz- 
ing student trajectories provides richer data for the model. 
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7. CONCLUSION 


Our work focuses on multi-step exercises with unbounded so- 
lution spaces. While open-ended exercises encourage more 
flexible problem solving (e.g. in comparison to multiple- 
choice questions), understanding a student’s progress is more 
challenging due to unbounded variations in student sub- 
missions. Given that digital learning platforms can easily 
archive the temporal dimension of student submissions, we 
proposed a new approach for learning representations of stu- 
dent knowledge by using program embeddings of student 
code submissions over time instead of hand-picked features. 
We showed that the trajectories of these representations pro- 
duce distinct clusters of different student learning behaviors 
not picked up by a model that only observes binary success 
outcomes. We also showed that these representations can 
predict future student performance. We envision creating 
automated hint systems, where deep knowledge tracing has 
the potential to identify weaknesses and provide personal- 
ized feedback. By being able to anticipate student struggles 
in particular, we are in essence capturing pedagogical content 
knowledge in an unsupervised fashion. These applications 
could help improve and personalize the learning experience 
of students both in the classroom and on online education 
platforms. 
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ABSTRACT 


The adoption of educational technologies such as e-textbook has 
offered a new opportunity to gain insight into teachers' usage of 
ICT (Information and Communication Technologies). In the e- 
textbook platform, customized digital products and the learning 
activities organized in digital environment require teachers to 
make greater efforts in planning lessons and producing resources. 
In addition, usage of technology can vary greatly from one group 
of teachers from another in various contexts. In this study, we 
demonstrate how computations like event segmentation and 
contextual numbers can be exploited in visualizing trajectories of 
teacher’s ICT usage. We also study with the experience structure 
via the implicit patterns within the raw data of an e-textbook 
platform. Such automated visual characterization might be helpful 
to the wide and scalable application of teaching analytics to 
represent teacher’s ICT usage. 


Keywords 


Visual analytics; teaching analytics; contextual numbers; ICT 


1. INTRODUCTION 


Information and Communication Technologies (ICT) are 
becoming increasingly pervasive in education [12] and are making 
a difference in the ways teacher plan lesson and organize activities 
[13]. It 1s also well documented that teachers need support to 
make effective use of information technology in their teaching, 
because the incorporation of ICT is not easy process which 
involves many technical complexities [10]. With a goal of better 
use of ICT, teaching analytics is conceived as an analytics 
approach that focuses on the design, development, evaluation of 
visual analytics methods and tools for teachers [20]. 


However, the crucial step of supporting teacher interventions 
based on learning analytics insights remains under-supported [17]. 
As it often happens elsewhere in learning analytics, most learning 
environments are not designed for data analysis and mining [8], 
even if they do analysis, they are designed to focus on analyzing 
student learning or behavior and provide feedback to the teacher 
[1,20], not to analyze and represent the teacher’s data they store. 
Therefore, many studies depict learning analytics for teachers 
rather than analytics about teaching [17]. 


In addition, although much work has been done on visualizing 
analytics result, their design and use is less understood, which can 
lead to the weak implementation as a result of promoting 
ineffective feedback [21,19,3]. In many cases, however, it is not 
easy to compare the complex objects over high dimension 
visualization which requires users to understand the semantics of 
visual representation and feature that are assumed by model and 


algorithm. Besides that, some visualization approaches present the 
narrow scope of the representation, as focused on one snapshot of 
a certain topic of data for a certain period time. It usually did 
represent several aspects of dataset that occurring within the 
environments but did not represent the nature of connections 
inside the datasets and provide a global view of usage [2]. As a 
consequence, the application of dashboards requires additional 
information processing in various work. 


The purpose of this study is to design a computational procedure 
based on behavior data with the intent to create a visualization of 
trajectory that will help describe teacher ICT usage. 


To explore these issues, we make a case study in which the data is 
gathered from an e-textbook server without any additional sensor 
or APIs. In previous study [23], we found that a segmentation 
method is effective in effort to provide features distilled for 
predicting e-textbook adoption in early days. In this study, we 
bring together event segmentation and one-dimension Sel/f- 
organizing map to integrate an authentic teaching experience 
involving digital environment with embedded robust and 
continuous characterizing of ICT usage trajectories. The raw data 
records which were created in a e-textbook platform will be 
computationally transformed and displayed, so that teachers and 
other stakeholders can utilize the information of result of 
contextual visualization to get insights and improve dynamic and 
diagnostic decision-making. 


2. DATA 


We investigated issues within the context of data from an e- 
textbook platform named ZoomClass. ZoomClass includes a web- 
based authoring environment and an iPad application for teachers. 
Teachers were given access to customize all digital content for 
specific teaching objectives. They typically create courses, upload 
media resources and products which are mostly customized by 
themselves in other tools (such as PowerPoint), design tasks, 
assign activities, and insert quizzes on the web-based environment 
before class. Also, they can record and upload photos and videos 
by iPad application. The users of ZoomClass are teachers and 
students at a primary school of Shanghai. We obtained data on 
teacher authoring action records and student response action 
records, for 110 teachers enrolled in this e-textbook platform, 
observed over more than 5 semesters since 2014 October. Until 
January 2017, the teachers have performed a total of 117,324 
actions, created 4,653 courses, uploaded 16, 901 digital resource 
included almost 9,000 image products and get 3,364,533 
responses from students. 
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Figure 1. The iPad application ZoomClass 
3. METHOD 


In this study, we bring together an event segmentation algorithm 
and a nonparametric mapping which is called contextual numbers, 
to integrate an authentic teaching experience involving e-textbook 
platform with embedded continuous characterizing of ICT usage 
trajectories. In general, the intent of event segmentation is to 
determine how a threshold should be set automatically when 
partitioning action streams into usage feature spaces. And the 
approach of contextual numbers is used to map the high 
dimensional space of usage to a continuous one-dimensional 
numerical field, which are ordered in the given context, similar 
numbers refer to similar high dimensional states of usage. Figure 
2 shows the computational procedure and associated steps, which 
will be discussed in detail in this section. 
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Figure 2. The computational procedure and associated steps 


3.1 Event Segmentation 

In our study, data comes from the raw records of an e-textbook 
platform. Two characteristics of this data are contained: |. Data 
only recorded by the back-end server without any sensor 
embedded in front-end, that means the grain size of our data is 
much bigger when comparing with the sensor data (such as 
clickstream); 2. Multi-platform operation, which would cause the 
break off of data capturing when teacher transfer to another 


platform. Thus, these two problems lead to an amount of missing 
action data among our data set. In considering of this issue, an 
event segmentation method is introduced to transform action 
records to event dataset. 


Event segmentation is a method means dividing a given number 
of observation into subsets with statistical characteristics that are 
similar within each subsets and different between subsets [4]. In 
this study, the goal of event segmentation is to automatically 
partition teacher actions into separate events, the segmentation 
method is only based the date time information of server log 
records. We consider action records in chronological order such 
that 


R= {Ri, Rm} (1) 


where R; is the ith action record in data set R with length m. A 
event segment e; ; which is a subset of R can be given as 


en; ={Rip. Ry} 1sisj<m (2) 


Intuitively, the time differences between inter-action records in an 
event are typically smaller than time differences between inter- 
action records from separate events, so the time intervals between 
observations are often considered as a criterion to judge 
partitioning [11]. 


With respect to the fact that teachers with various contexts have 
different usage of e-textbook, it is very likely that teachers 
perform diverse action frequencies during different period. Zheng 
and colleagues [24] developed an analysis method to discover the 
user water behavioral habits, in their invention, a novel 
continuous event segmentation algorithm based on threshold 
optimization was created to automatically separate the water 
usage records into multiple individual bath events for each user, 
this study employed a similar method to create features from 
teachers’ action record data sets. In the event segmentation 
algorithm created by Zheng et al., a threshold of time difference 
has been used to determine whether consecutive action records are 
in a same event. The algorithm consists of following steps: 1. 
Compute inter-action intervals; 2. Compare every interval to the 
threshold of time difference. In step 2, If the interval is smaller, 
these two inter-actions are considered in a same event, if the 
interval is greater, they are divided into two different events. The 
algorithm will run through all of inter-action intervals, then we 
can obtain individual events from action log sets. An automatic 
threshold optimization model was developed to search the optimal 
threshold value to segment event. 


The threshold optimization of each teacher in one week consists 
of following steps: 1. Segment events with successively varying 
thresholds, a fixed time delta d is set between two successive 
thresholds, we consider this threshold set in chronological order 
such that 


(ne — {ts1,tso, CS3 Te (3) 


2. Compute event number y for each threshold ts; 3. Specify 
minimum rate of event numbers’ change for optimal threshold 
detection. In step 3, optimization algorithm uses a sliding window 
with a fixed size. The window can only contain n points, 
beginning at the current point and ending right before the next 
identified point. The optimization tries to find a possible starting 
point which is followed with a sequence of almost unchanged 
points. Suppose the threshold of the current point is ts;, the 


average rate of event numbers’ change cr is defined as follows: 


1 es . 
er(ts;) = a isi i a 


(4) 
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the final optimal threshold can be selected from given threshold 
set as follows: 


ot(TS) = Argmin,(cr(ts;)) (5) 


Figure 3 shows an example of an event segmentation with varying 
thresholds. Here, the number of events declines rapidly when 
threshold is smaller than 10 minutes, it implies most inter-action 
intervals of the teacher are smaller than 10 minutes. And there is a 
significant possibility to separate an individual event into two or 
more sub-events if a small value is determined as threshold. 
Therefore, an interval value is more rational to determined as 
threshold until the number of individual events touches down and 
levels off at almost zero. The slopes of inter-thresholds are used to 
detect the signal of change rate. When the average of n (In this 
case, 1 1S set to 8) consecutive slopes of inter-threshold are closet 
to zero, the first threshold point in sliding window is flagged as 
optimal threshold value of an individual teacher’s inter-event 
interval in a week. In Figure 2, the point of 26 minute is possible 
the optimal threshold. 
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Figure 3. A sliding window which searches an optimal 


threshold point. Suppose that n = 8, the point of 26 minute is 
possible the optimal threshold 


3.2 Creation of Features 

We employed event segmentation algorithm in both teachers’ and 
students’ action records. The resulting segmented event dataset 
consists of 10,146 total event rows from 117,324 teachers’ 
authoring action records, and 23, 203 total teaching activity rows 
from 3,364,533 students’ response records. With the respect to 
trajectory visualization, a process of aggregation is performed in 
these events for frequency conversion and resampling by a week 
to generate time series data set. Eight features were distilled from 
processed dataset: 


The total duration of producing event (DE) — Event 
transformed from teachers’ action data which is about producing 
indicates the fact that they create new media resources and upload 
files with the authoring platform. The total duration of producing 
event allows us to know how long would teachers spend on 
preparing their lessons on the learning platform in a week. 


The number of long producing events (LE) — In order to 
minimize noise in the segmentation, we discretize events 
(exclusive of single-action events) into three buckets based on 
quartiles of durations of every events. The producing event with a 
duration longer than upper quartile is considered as long 
producing event. 


The number of middle producing events (ME) — The producing 
event with a duration longer than lower quartile and shorter than 
the upper quartile is considered as middle producing event. 


The number of short producing events (SE) — The producing 
event with a duration shorter than the lower quartile is considered 
as short producing event. 


The number of single-action producing events (SPE) — The 
events with only one single action are special in this case. A 
single-action event could be created in the situation where a 
resource producing last for a long time without any other 
neighboring action or just a testing action is performed. Therefore, 
we Separate the single-action events into two groups by its action 


type. 


The number of single-action common events (SCE) — The 
event consists of only one single action which has not explicit 
relation with producing, such as creating a virtual folder with a 
default name, are considered as a common event with a single 
action. 


The number of teaching activities (TA) — Teaching activity in 
this study is about ‘consuming’ which indicates the evidence that 
teachers utilize the resources they’ve uploaded to the learning 
platform before class. With event segmentation, teaching activity 
is transformed from students’ concurrent response records which 
include answer submitting, media file uploading and help 
requesting. The tasks assigned inside e-textbook application by 
teachers are also considered as the teaching activity even they are 
mostly finished after class. 


The number of engaging days (ED) — The day that teacher is 
active in authoring platform is considered as an engaging day. 
However, the single-action common events are omitted when 
determining whether a teacher is active in a day. 


3.3. Contextual Numbers 

Self-organizing map is a nonlinearly projecting mapping 
algorithm which is introduced by Kohonen [7]. The earliest 
applications were in engineering tasks, later the algorithm has 
become a generic methodology, which has been applied in 
clustering, visualization, data organization, characterization, and 
exploration [6]. Self-organizing map consists of organized nodes 
that include a N-dimensional weight vector. In regard to the 
observations X = {x1,X2, ...,X,}in N-dimensional space x; € R", 
the procedure can be summarized in three processes: competition, 
cooperation and self-adaptation. The SOM training algorithm can 
be thought of as a net which is spread to the data cloud. In general, 
it moves the weight vectors to make them span across the data 
cloud, so that the neighboring nodes get similar weight vector [7]. 


Traditionally, most applications of SOM _ algorithm were 
organized in a two-dimensional coordinate system (such as [2], 
[18]). In these applications, after projecting the data to SOM grid, 
the indexes of nodes as single values are able to create a new 
contextual order, which can be used to transform each high- 
dimensional point to a new computational space. The close points 
are similar in this context, however, this similarity is not 
interpretable in a single dimensional arrow comparing with classic 
number space [15]. 


In this regard, a one-dimensional SOM called contextual numbers 
was introduced by Moosavi [14], this method can be seen as a 
sequence of ordered numbers pointing to a high-dimensional 
space, these numbers are ordered according to their similarities 
within the selected high-dimensional state space or context. In 
contextual numbers, K nodes will be produced in one-dimension 
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after the mapping of SOM with X, and each node with an attached 
high-dimensional weight vector represents the original 
information. Instead of using the values within the nodes, a series 
new contextual orders were created. It can be summarized in the 
following steps: 1) Calculate the posterior probability of assigning 
contextual number; 2) Select the corresponding number when the 
posterior probability reaches the dominant peak as the node index. 
The difference between the two-dimensional and the one- 
dimensional can be reflected in the relation of indices and the 
weight vector. In a two dimensional grid, the neighborhood 
similarity expands in two directions. Therefore, there is no direct 
correlation between the numerical values of indices and the 
similarity of their weight vectors. But in one dimensional grid, 
valuable property of contextual numbers is that there is a direct 
correlation between indices [14]. As in the most two-dimensional 
cases, the final index of trained SOM will not be used directly as a 
numerical value but instead of assigned weight vector, contextual 
numbers allow us to create a continuous number space converted 
from a high dimension space, which can fit completely to a 
univariate space [15]. In terms of usage time series analysis in this 
study, we can have a univariate usage time series for each teacher 
along the week by conversion of contextual numbers. 


It should be noted that the index we mapped to each node 1s not 
the classic numbers. The value of these indices are not means the 
performance grades, but the similarity of two or more nodes. If 
two index have close values (e.g. node number | and node 
number 2 in SOM network. Numbering is arbitrary, but we 
usually start from upper-left corner and go row by row) they are 
similar in this context [15]. 


3.4 The second staged clustering 

With the indices (contextual numbers), hierarchical clustering 1s 
performed in this part. One advantage of hierarchical clustering 
algorithms is that it can help with the interpretation of the results 
by creating meaningful taxonomies. On account of these numbers 
implicate contextual information which is difficult to interpret, a 
common two-staged clustering is employed to combine most 
similar indexes, as what the previous applications did to the nodes 
of two-dimensional SOM grid (such as [22,16]). Then a typology 
from clustering results is developed, which is also proven to make 
it more accessible when stockholders are involved in exploration 
of data using visual inspection [5]. 


In order to get good performance of clustering, first we employ 
the k-means and the intrinsic metrics—within-cluster Sum of 
Square for Error (SSE) to compare the performance of different 
number of clusters. Based on the within-cluster SSE, the elbow 
method is used to estimate the optimal number of clusters k for a 
given task. In this study, the elbow is located at k = 5, thus we 
choose it as the number of clustering. Finally, we perform 
hierarchy clustering on the contextual numbers. 


4. RESULT 


This section presents the two stages of our research: in the first 
part the high dimensional observations from the processed time 
series data are converted to corresponding contextual numbers, a 
series of continuous indices and a specific typology which is built 
for interpretation; in the second part, we apply this to produce 
visualizations on teacher ICT use trajectory. 


4.1 Usage Typology 


Firstly, a SOM network is trained on a single dimension network 
with the eight-dimensions usage data, and the range of indexes 1s 
set from 0 to 29. Therefore, each index node has two neighbors 
except the first and the last. In this regard, we apply the second 
staged clustering to discretize the contextual number indexes into 
groups for interpretation, and it is determined that there are five 
groups to be discovered in our study. The details of the groups are 
shown in Table 1. 


As can be seen in Table 1, Group A characterizes the Limited use 
pattern. Teachers in this group have spent very few time on using 
the authoring platform. Few product indicates that they never 
upload media resources; Meanwhile, they organize a few 
activities once a week, which illustrates the technology is seldom 
used in their classes; The usage of this group usually is performed 
at the beginning or the end of semesters. 


Group B characterizes the Early use pattern. The teachers in this 
group organize even fewer activities than the teachers in Group A. 
But they have at least a middle or a short producing event a week, 
which means some resources were produced to prepare for the 
lesson., they try to use the platform to prepare lessons. We find 
out usage of that this group is the mainstream during the first three 
semesters. 


Group C characterizes the Consuming use pattern. Teachers in this 
group begin to use the learning platform more frequently than 
Group A and Group B. They are very willing to implement this 
application to organize teaching activities and usually have plenty 
of responses on the e-textbook, but they only produce at most 
once a week. We can also find that they have highest single-action 
common actions than teachers who are in other groups, since they 
tend to consume the resources rather than produce. 


Group D characterizes the Moderate use pattern. The teachers in 
this Group begin to frequently produce resources on the platform, 
many of them would use the learning platform three out of five 
working days for every week. Compared to those three groups we 
mentioned before, teachers in this group are actually using this 
technology to plan lessons with the resources which are built by 
themselves. As they are producing frequently, we find that they 
have highest mid-events. But compared to teachers in Group C, 
they have slightly less activities which means they are not relying 
on the e-textbook to teach in classes like teachers in Group C do. 


Group E characterizes the /ntensive use pattern. The teachers in 
this Group usually heavily produce resources during a long time, 
they produce many resources on the platform. Among the five 
working days each week, they almost produce everyday, they also 
organized numerous activities that means they are actually use the 
application a lot in class. 


Therefore, we can build some meaningful names and stories for 
every group and create fictitious typology labels to the contextual 
number indexes, in order to provide an easy way to understand the 
contextual meaning of indexes. As shown in Table 2, we 
summarize each group, giving the key characteristics and the 
indexes belong to. 
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Table 1. Grouping results showing the mean value for each 
feature and cluster 


Group 
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| DE | 0.25 
| LE | 0.000 | 0.285 | 0.288 | 0.522 | 2.156 _ 
| ME | 0.000 | 0.692 | 0.742 | 2.597 | 2.012 _ 
| SE | 0.029 | 0.371 | 1.000 | 0.827 | 0.514 _ 
| SPE | 0.206 | 0.432 | 0.327 | 0.931 | 0.452 _ 
| SCE | 0531 | 0.532 | 2.336 | 1.743 | 2.218 
| TA | 6.396 | 2.883 | 21.107 | 8.560 | 32.863 
| ED | 0.025 | 1.065 | 1408 | 2.866 | 4.174 _ 


Table 2. The user typology derived from two-stage clustering 


Typology Label 


Few teaching activities 
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Producing at most once a week 
More independent actions 
Frequently producing 
Highest middle-event rates 
Slightly less activities 
Almost producing everyday 
Organizing numerous activities 
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Figure 4. Sample trajectory visualization 


4.2 Usage Trajectory 


Finally, we can explore visual trajectory with the typology to 
identify the implicit patterns and hypothesis. This visualization 
provides the capability to trace states and discovery patterns 
without reducing the information to simple statistics, it illustrates 
the teacher usage trajectories which is helpful as teachers and 
stockholders rarely trace the process of how they use the ICT in 
teaching. 


As shown in Figure 4, Y axis is the index of one dimension SOM 
and X axis shows the week which is the length of time to be 
observed in this case. The figure shows the states and trajectories 
of each teacher over the time. Therefore, similar teacher has 
similar index number during the time. It allowed us to identify 
how a teacher uses this technology by comparing the trajectories 
and pattern of each teacher in relation to the others using the 
contextual numbers of SOM. If we are familiar to a few teachers' 
usage, we can consider these teachers as contexts for relative 
positioning when identifying a new teacher usage, even we don't 
know the interpretation of the contextual numbers. As shown in 
Figure 4, we can consider Teacher 3 as a template if we were 
familiar with the his or her usage or performance, then the usage 
of Teacher 4 is easy to be identified by comparing their similar 
trajectories. The result of our statistical analysis on index set 
shows that the Teacher 3 and Teacher 4 have a lowest Euclidean 
distance. On the other hand, we can also automatically find 
similar teachers based on distance calculation between each 
trajectories. 


As the use of an "adopted" technology can vary greatly from one 
group of teachers to another [9], this figure provide an easy way 


to partition the teachers in terms of the variations along the two 
dimensions of contextual index and time of usage. In this case, as 
shown among the intense user group, Teacher 3 and Teacher 4, 
the contextual numbers indexes mostly ranged from 10 to 29, 
which were almost consistently higher than the the indexes of 
moderate user group, Teacher 1 and Teacher 2, whose usage was 
mostly labeled as early use or consuming use in the first three 
semesters. Apparently, Teacher 1 and Teacher 2 adopted this tool 
for teaching, but did not rely on the tool in the same way that 
Teacher 3 and Teacher 4 did. However, it is not rational to 
evaluate the performance of teachers’ ICT with the number of 
index, because the SOM indexes are used as computable numbers 
to represent the state based on the contexts, but the values of 
indexes don't follow the concept of natural numbers which can be 
interpreted as ordered grades. Therefore, the higher index does not 
always indicate better performance, even though it seems that 
higher contextual number index is labeled with more intensive use 
in this case. 


This method is also able to indicate potential patterns from 
trajectories of contextual numbers. As shown in Figure 4, the state 
of teacher's usage fluctuates visibly over each semester. More 
specifically, as we can see Teacher 4 in the last semester, at 
beginning of this semester the number of state stands at a limited 
usage index. Then, the number shoots up over the next two weeks, 
peaking at 29, which means a state of intensive use. After that, the 
contextual number declines rapidly for two weeks, bottoming out 
at 16 which is labeled as a consuming using index. The next week 
experiences a very sharp rise, reaching the intensive use area 
again. According the indexes of usage in the following weeks, a 
total of 5 peaks can be respectively detected. The peak pattern 
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discovered from trajectory plotting describes a behavior that 
teacher tends to produce the teaching resources intensively in first 
one week, then consume them in this week and the following one 
to two weeks. We apply frequent sequence mining to segmented 
trajectory data of active teachers to explore this idea, the result 
shows that the sequences of peak pattern (such as Sequence 
[Consuming use, Intensive use, Consuming use]) all get highest 
frequency in the group of their week length. 


5. CONCLUSION 


This paper introduced a computation procedure for visualizing the 
trajectories of teacher’s ICT usage based on the resource 
producing process and the experience structure via the implicit 
patterns within the raw data by event segmentation and contextual 
numbers. The resulting visualization provides the capability to 
trace states and discovery patterns without reducing the 
information to simple statistics, such automated visual 
characterization might be helpful to the wide and scalable 
application of teaching analytics to represent teacher’s ICT usage. 
Our future work will be oriented to the spatiotemporal dynamic in 
education, especially the application of ICT, in which the 
knowledge extraction of web-based education system can be 
viewed as a formative evaluation technique. In this condition, 
high-dimensional time series with different features can be 
replaced by a series of contextual numbers, where this numerical 
numbers can be embedded in any data driven analysis and 
prediction [14]. 
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ABSTRACT 

This paper attempts to model network dynamics of MOOC 
discussion interactions. It contributes to providing alternatives 
to conducting null hypothesis significance testing in 
educational studies. Using data collected from two successive 
psychology MOOCs in 2014 and 2015, the probabilistic 
longitudinal network analysis was performed by employing 
stochastic actor-based models with statistical accuracy. 
Understanding the mechanisms that drive the dynamics of 
discussions shed light on the design of a self-generated and 
learner-supported learning environment to meet the challenges 
of accommodating a massive and global student body. 
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learning environment to meet the challenges’ of 
accommodating a massive and global student body. 


Using data collected from two successive psychology MOOCs 
in 2014 and 2015 and applying probabilistic longitudinal 
network analysis, this study seeks to rigorously measure the 
dynamic mechanisms that drive discussion change over time. 
The probabilistic analysis was performed by employing 
stochastic actor-based models with statistical accuracy. 


METHODS 

The probabilistic longitudinal network analysis was performed 
by employing stochastic actor-based models defined and 
evaluated with the program Simulation Investigation for 
Empirical Network Analysis. Four hypotheses are proposed to 
test the network dynamics of MOOC discussions. 


Hypothesis 1 (H1): There is a tendency towards reciprocation 
in studied discussion networks (i—] and j—1). (Dyadic Level) 


Hypothesis 2 (H2): There is a tendency towards transitivity 
(i.e. increasing transitivity and reducing distance between 
actors; 1], J—k and i—k). (Triadic Level) 


Hypothesis 3 (H3): There is a tendency towards the increasing 
volume of interactions between learners themselves. 


Hypothesis 4 (H4): There is a tendency towards preferential 
attachment within the studied networks. 


PRELIMINARY RESULTS 


Descriptive statistics of the discussion network 

In 2014 MOOC, 1915 participants posted 5251 messages in 
total, of which 217 are threads, 5034 are replies and comments, 
while in 2015 psychology MOOC, 962 threads were provided, 
and 3097 are replies and comments. 


In 2014 Psychology MOOC, there are topics initiated by TAs 
to collect feedbacks for individual sections and to answer 
content-related Q&A for each section. As shown in Figure 1, 
the number of the postings falling into the discussing 
categories initiated by TAs 1s relatively larger than the number 
of the same topics which are initiated by learners themselves. 
The category “content-related Q&A initiated by TAs for 
individual sessions” seems to attract a good number of replies 
and comments over time. Interestingly, as shown in Figure 1, 
the discussions of exercises share a similar quantitative pattern 
of content-related discussions; while the enquiries about the 
logistics of the course follow a similar pattern of technical 
discussions in both two offerings of psychology MOOCs. In 


Proceedings of the 10th International Conference on Educational Data Mining 336 


2015 Psychology MOOC, technical problems occurred during 
the mid-examination, showing as a peak in Figure 1. 


Network Dynamics 

Table 2 and 3 present the results of SIENA estimation. As 
shown in Table 2 and 3, the results of Model 0 (network 
effects: reciprocity; transitivity) indicate a tendency for 
participants to create mutual relationships at both dyadic and 
triadic levels, which leads to cohesiveness in the studied 
networks. This confirms that hypothesis Hl and H2 are 
accepted. The exceptional case is the transitivity effect 
identified in the category of “feedback” (1.e. general feedbacks 


category 
about 
~ canlent 
~~ feedback 
—— others 
——~ TA abeut 


posis riumber 


~~ TA leedback 
TA_qa 
TA welcome 


— technology 


PA & 1 i 15 
course week 


posts_number 


to instructors and TAs initiated by learners), where there 1s no 
tendency for participants to create mutual relationship at 
triadic levels. This deserves a detailed examination in the 
future analysis. Interestingly, under the topic categories of 
“feedback” and “TA about” (i.e. enquiries about the logistics 
of the course initiated by TAs), when same role is used as a 
control variable, the transitivity effect is significant with a 
negative coefficient. Compared to discussions in other 
categories, it is less likely to create cohesive subgroups when 
learners provide feedbacks to the course and enquiries about 
course logistics. 


250- 


200 - 
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about 
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—— feedback 
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Figure 1. The number of postings within different discussion topics over time (2014 left & 2015 right). 


In both courses, same role is a significant covariate effect 
with a negative coefficient. Thus, H3 (Model 1: reciprocity; 
transitivity; same role) is rejected, indicating that there 1s no 
tendency towards an increasing volume of interactions 
between learners. 


H4 (Model 2: reciprocity; transitivity; Activity of alter) states 
that there 1s a tendency towards preferential attachment 
within the studied networks. The preferential attachment 
effect is not consistent among discussions of different topics. 
In most discussions, there is a tendency for participants who 
are actively involved in forum discussions in the early stages 
to become even more engaged over time. Nevertheless, when 
discussing exercises in 2014 Psychology MOOC, there is no 
preferential attachment effect, which deserves a future 
examination. 
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errors in parentheses (2014 Psychology) 
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ABSTRACT 


Research on learner behaviors and course completion within 
Massive Open Online Courses (MOOCs) has been mostly confined 
to single courses, making the findings difficult to generalize across 
different data sets and to assess which contexts and types of courses 
these findings apply to. This paper reports on the development of 
the MOOC Replication Framework (MORF), a framework that 
facilitates the replication of previously published findings across 
multiple data sets and the seamless integration of new findings as 
new research is conducted or new hypotheses are generated. MORF 
enables larger-scale analysis of MOOC research questions than 
previously feasible, and enables researchers around the world to 
conduct analyses on huge multi-MOOC data sets without having to 
negotiate access to data. 


Keywords 
MOOC, MORF, replication, meta-analysis. 


1. INTRODUCTION 


Massive Open Online Courses (MOOCs) have created new 
opportunities to study learning at scale, with millions of users 
registered, thousands of courses offered, and billions of student- 
platform interactions [1]. Both the popularity of MOOCs among 
students [2] and their benefits to those who complete them [3] 
suggest that MOOCs present a new, easily scalable, and easily 
accessible opportunity for learning. A major criticism of MOOC 
platforms, however, is their frequently high attrition rates [4], with 
only 10% or fewer learners completing many popular MOOC 
courses [1, 5]. As such, a majority of research on MOOCs in the 
past 3 years has been geared towards increasing student 
completion. Researchers have investigated features of individual 
courses, universities, platforms, and students [2] as possible 
explanations of why students complete or fail to complete. 


A majority of this research, however, has been limited to single 
courses, often taught by the researchers themselves, which is due 
in most part to the lack of access to other data. In order to increase 
access to data and make analysis easier, researchers at UC Berkley 
developed an open-source repository and analytics tool for MOOC 
data [6]. Their tool allows for the implementation of several 
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analytic models, facilitating the re-use and replication of an 
analysis ina new MOOC. 


Running analyses on single data sets, however, still limits the 
generalizability of findings, and leads to inconsistency between 
published reports [7]. In the context of MOOCs, for example, one 
study investigated the possibility of predicting course completion 
based on forum posting behavior in a 3D graphics course [8]. They 
found that starting threads more frequently than average was 
predictive of completion. Another study investigating the 
relationship between forum posting behaviors, confusion, and 
completion in two courses on Algebra and Microeconomics found 
the opposite to be true; participants that started threads more 
frequently were /ess likely to complete [9]. 


The current limited scope of much of the current research within 
MOOCs has led to several contradictory findings of this nature, 
duplicating the “crisis of replication” seen in the social psychology 
community [10]. The ability to determine which findings 
generalize across MOOCs, and what contexts findings stabilize, 
will lead to knowledge that can more effectively drive the design 
of MOOCs and enhance practical outcomes for learners. 


2. MORF: GOALS AND ARCHITECTURE 
To address this limitation, we have developed MORF, the MOOC 
Replication Framework, a framework for investigating research 
questions in MOOCs within data from multiple MOOC data sets. 
Our goal is to determine which relationships (particularly, 
previously published findings) hold across different courses and 
iterations of those courses, and which findings are unique to 
specific kinds of courses and/or kinds of participants. In our first 
report of MORF [11], we discussed the MORF architecture and 
attempted to replicate 21 published findings in the context of a 
single MOOC. 


MORF represents findings as production rules, a simple formalism 
previously used in work to develop human-understandable 
computational theory in psychology and education [14]. This 
approach allows findings to be represented in a fashion that human 
researchers and practitioners can easily understand, but which can 
be parametrically adapted to different contexts, where slightly 
different variations of the same findings may hold. 


The production rule system was built using Jess, an expert system 
programming language [15]. All findings were programmed into 1f- 
else production rules following the format, “If a student who 1s 
<attribute> does <operator>, then <outcome>.” Attributes are 
pieces of information about a student, such as whether a student 
reports a certain goal on a pre-course questionnaire. Operators are 
actions a student does within the MOOC. Outcomes are, in the case 
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of the current study, whether or not the student in question 
completed the MOOC (but could represent other outcomes, such as 
watching more than half of the videos). Not all production rules 
need to have both attributes and operators. For example, production 
rules that look at time spent in specific course pages may have only 
operators (e.g., spending more time in the forums than the average 
student) and outcomes (i.e., whether or not the participant 
completed the MOOC). 


Each production rule returns two counts: 1) the confidence [16], or 
the number of participants who fit the rule, 1.e., meets both the if 
and the then statements, and 2) the conviction [17], the production 
rule’s counterfactual, 1.e., the number of participants who match the 
rule’s then statement but not the rule’s if statement. For example, 
in the production rule, “If a student posts more frequently to the 
discussion forum than the average student, then they are more likely 
to complete the MOOC,” the two counts returned are the number 
of participants that posted more than the average student and 
completed the MOOC, and the number of participants who posted 
less than the average, but still completed the MOOC. As a result, 
for each MOOC, a confidence and a conviction for each production 
rule can be generated. 


A chi-square test of independence can then be calculated comparing 
each confidence to each conviction. The chi-square test can 
determine whether the two values are significantly different from 
each other, and in doing so, determine whether the production rule 
or its counterfactual significantly generalized to the data set. Odds 
ratio and risk ratio effect sizes per production rule are also 
calculated. Stouffer’s [18] Z-score method can be used in order to 
combine the results per finding across multiple MOOC data sets, to 
obtain a single statistical significance. 


Currently, 40 MOOC data sets and 21 production rules related to 
pre-course survey responses, time spent in course pages, forum 
posting behaviors, forum post linguistic features, and completion 
are incorporated in the framework. 


3. FUTURE WORK 


First, we plan to expand the current set of variables being modeled 
in MORF, both in terms of predictor (independent) variables and 
outcome (dependent) variables. This will enable us to replicate a 
broader range of published findings. Our first efforts do not yet 
include findings involving data from performance on assignments 
or behavior during video-watching, two essential activities in 
MOOCs. 


Second, we intend to add to MORF a characterization of the 
features of the MOOCs themselves, towards studying whether 
some findings fail to replicate in specific MOOCs due to the 
differences in design, domain, or audience between MOOCs. 
Understanding how the features of a MOOC itself can explain 
differences in which results replicate may help us to explain some 
of the contradictory findings previously reported in single-MOOC 
research. Doing so will help us to understand which findings apply 
in which contexts, towards understanding how the different design 
of different MOOCs drive differences in the factors associated with 
student success. 
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ABSTRACT 


Few attempts have been made to create student models that cluster 
student and school level traits as a means to design personalized 
learning interventions. In the present work, data from 
ASSISTments was enriched with publicly available school level 
data and K-Means clustering was employed. Results revealed the 
importance of school locale, measures of district wealth, and 
system interaction patterns as potential foci for personalization. 
Clusters were then applied to a test set of held out data and cluster 
assignments were used to help predict end-of-year standardized 
mathematics test scores. Findings suggest that while cluster 
interpretations were not generalizable to held out data, clustering 
was generally helpful in predicting standardized test scores. 


Keywords 
K-Means Clustering, Student-System Interactions, School Level 
Characteristics, Standardized Tests, Ensembled Prediction Model. 


1. INTRODUCTION 


The focus of research using vast educational data often lends itself 
to the development of learner models, or various sophisticated 
predictive models that help to pinpoint when and how learning 
occurs on a personalized level. Popular approaches include 
Bayesian Networks (1.e., Bayesian Knowledge Tracing) [3], 
Performance Factors Analysis [6], and Neural Networks (..e., 
Deep Learning) [4]. However, it is valuable to ask if simpler 
models built to leverage student, school, and district level data can 
be useful in establishing learner profiles. 


The use of clustering to group similar students within various 

types of online learning environments has typically been a 

successful endeavor [1, 2, 7, 8]. The present work seeks to 

balance the complexity of working with high volumes of 

educational data and building simple predictive learner models 

through clustering by answering the following research questions: 

1. Are there distinct types of learners within ASSISTments [5] 
that can be identified by clustering student, school, and 
district level characteristics and measures of student/system 
interaction? 

2. What student types are defined via cluster interpretation? Do 
interpretations generalize to unseen data? 

3. Can clusters help predict significant differences in end-of- 
year test scores? 


2. METHODOLOGY 


The present work assessed log files from students in the state of 


Maine working in ASSISTments [5], an online learning system 
focused on middle school mathematics, during the 2014-2015 
academic year. This data was extended by merging additional 
school and district level data from the Common Core of Data 
supported by the NCES and IES (https://nces.ed.gov/ccd/). 
Students’ scores on the standardized, end-of-year TerraNova 
mathematics test were also included in the dataset. 


For each student, the dataset contained averages for the following 
student/system interaction features: problem count, time spent on 
problems, percent correct across assignments, hints used per 
problem, number of problems per assignment for which hints 
were used, and assignment completion rate. Additionally, each 
student’s data included continuous measures retrieved from the 
NCES/IES data (1.e., the percentage of students in the school 
eligible for free or reduced lunch) as well as one-hot encoded 
forms of categorical features like school locale. The cleaned 
dataset represented 1,557 unique students from 21 schools, with 
171,983 unique student/assignment pairs stemming from 35,127 
assignments. Each observation or row represented the overall 
performance and characteristics of a single student and their 
school or district. De-identified data is available at 
tiny.cc/EDM2017Clustering for further reference. 


The modeling approach used in the present work was adapted 
from that in [1]. An initial 70% of the data was randomly selected 
to form the training set. The training set was used for initial K- 
Means clustering and cluster interpretation. The K-Means 
algorithm was sourced from R’s statistics package, implementing 
Euclidean distance as the default distance measure. The remaining 
30% of the data was used to form the test set. The test set was 
used to build models predicting TerraNova scores. First, 
predictions were made to assign students in the test set to a 
cluster. Following student assignment, clusters were reinterpreted 
to verify whether trained interpretations generalized to unseen 
data. Cluster membership was then used to help predict 
TerraNova scores alongside student-system interaction features 
using cluster-specific stepwise linear regressions. These 
regression models were then ensembled and measures of model 
accuracy were compared to a traditional approach where K = 1. 


3. TRAINING 


In order to determine the optimal value for K, 10-fold cross 
validation was implemented on the training set to build scree 
plots. To determine the most appropriate value from this set, the 
mean and median of optimal K values across folds were 
considered (M = 4.1, Med. = 4). As such, four clusters were 
forced using K-Means on the training data. The four resulting 
clusters were characteristic of unique types of students, ultimately 
labeled as “proficient,” “struggling,” “learning,” and “gaming.” 
Graphics and additional information on cluster characteristics are 
available at tiny.cc/EDM2017Clustering for further reference. 
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Table 1. 


Coefficients, Standard Errors, and Model Statistics per cluster on test set data when K=1 and K=4. 


K=4 

2 (n=160) 3 (n=124) 4 (n=31) 

b SE b SE b SE 
504.41*** 30.36. -567.63*** 34.66  680.14*** 63.13 
268.30*** 33.92 131.02*** —- 35.16 18.73 68.74 

0.01 0.04 0.09 0.06 -0.09 0.09 
8.47 15.55 22.10 18.68 -18.80 34.25 
8.00% 3.66 -38.84***  —- 8.02 -52.73* 19.80 
Ae 4.13 49,75 *#* 9.68 71.09** 24.23 


kad 
1 (n = 442) 1(i=127) 

IVs b SE b SE 

Intercept 631.94*** 20.37 POURS fa ks 51.78 

Percent Correct 110.66*** 22.76 81.95 61.70 

Ave. Time -0.08** 0.03 -0.10 0.07 

Completed 0.35 12.13 -63.05 39.89 

Total Hints 173 2.69 7.01 6.08 

Hint Instances -0.11 3.53 -9 34 11.25 

Model Stats 

F (DF) L755"? (55436) 1.30 (5, 121) 

R’ (Adj. R’) 0.168 (0.158) 0.051 (0.012) 


4. TESTING & MODEL EVALUATION 
Using the remaining 30% of the data that had been held out from 
the training set, student, school, and district level features 
(excluding TerraNova test score) were used to predict student 
assignment to one of the four clusters developed in training. 
Following student assignment, clusters were interpreted to verify 
whether initial cluster labels generalized to this unseen data. 
Cluster characteristics varied for the test set, suggesting that 
cluster interpretations did not generalize. Graphics and additional 
information on _ cluster characteristics are available at 
tiny.cc/EDM2017Clustering for further reference. 


Cluster membership was then used to help predict TerraNova 
scores alongside student/system interaction features using cluster- 
specific stepwise linear regressions. Following the ensembling 
approach used in [7], separate regression models were built for 
each cluster before being ensembled to form a prediction model. 
Cluster models helped to depict the relative importance of 
student/system interaction features in the prediction of TerraNova 
scores for each value of K, as shown in Table |. Variability in 
feature significance was observed across clusters. An alternative 
prediction model was constructed using the full dataset 
(essentially, K=1) in order to compare the accuracy of ensembled 
cluster models to an unclustered baseline. Table 1 presents 
unstandardized beta coefficients, standard errors, significance 
values, and overall model statistics across clusters and values of 
K, and reveals that cluster assignment was sometimes significant 
in predicting TerraNova scores. 


In terms of prediction model accuracy, Mean Absolute Error 
(MAE) and Root Mean Squared Error (RMSE) were both lowest 
when K=4 (23.27 and 30.32, respectively, compared to 25.88 and 
33.44 when K=1). Additionally, the difference between MAE and 
RMSE was lower when K=4 (7.05 compared to 7.56), suggesting 
that the variance in individual prediction errors decreases as K 
increases. Variance explained, as measured by R’, was also higher 
when K=4, suggesting that the ensembled model was a stronger 
option than grouping all data together into a single cluster. 


5. DISCUSSION 


Results of our clustering exploration revealed that there are 
distinct types of learners within ASSISTments that can be 
identified by using K-Means to cluster student, school, and district 
level characteristics and measures of student/system interaction. 
Results suggested that clusters contained identifiably different 
patterns of student behavior. However, applying these clusters to a 
test set revealed that cluster interpretations did not generalize well 
to held out data. The results of subsequent linear regression 
models suggested that if clustering could be reliably linked to 


22.87*** (5, 154) 
0.426 (0.408) 


8.18*** (5, 118) 
0.257 (0.226) 


2.00 (5, 25) 
0.286 (0.143) 


student features, the approach could potentially be used to help 
drive personalization within the ASSISTments platform. 


Limitations of this work include being bound by the hierarchical 
nature of the data, assumptions inherent to K-Means analysis, and 
the potential for artificial inflation of model accuracy due to 
regression to the mean. As it stands, clustering does not 
necessarily fail as a method of personalization. Understanding the 
features that are important to each cluster, as well as the overall 
accuracy of ensembled cluster models and how such accuracy 
differs with varying values of K, could help to guide the design of 
learning interventions specific to particular students. However, the 
reliability of the approach may be extremely sensitive to the 
quantity and quality of available data, making clustering a 
difficult approach for personalized learning. 
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ABSTRACT 


It is reported by different universities that over 40% of stu- 
dents do not complete their studies within 6 years. Espe- 
cially in technical courses, the drop-out rate is already very 
high at the beginning. Therefore an automatic drop-out pre- 
diction is useful for a monitoring system. Since the study 
progress data can be sorted by time, we show how they can 
be transformed into a multivariate time series. Then we 
examine the dynamic time warping (DTW) distance in con- 
junction with the k-nn classifier and show how DTW can be 
used as an SVM kernel for drop-out prediction on the time- 
series data. With this approach, we are able to recognize 
about 67% of the drop outs from the course of study after 
the first semester and about 60% after the second semester. 


1. INTRODUCTION 


The number of drop out is a big problem for many univer- 
sities. Over 40% of students do not complete their studies 
within 6 years [1]. Especially in technical courses, the num- 
ber of drop outs in the first semesters is high. So in [5] 
it is reported that in the Electrical Engineering course the 
drop-out rate of beginners is about 40%. Human monitor- 
ing is used to solve this problem [5]. With a large num- 
ber of students, this can lead to a huge manual effort, so 
that a machine-made pre-selection could facilitate the work 
of a human decision-maker. Most students fail in the first 
semesters, which requires an early prediction. The quality of 
the available data is very important for automatic drop-out 
prediction. However, due to data protection laws, often lit- 
tle data are available for use. The data is often restricted to 
only a small amount of private data and the study progress 
data, so that only examinations, their corresponding grades, 
and the number of attempts per semester are given. Because 
of the dearth of data, it is important to obtain as much se- 
mantics as possible from the data, such as temporal aspects. 
The study progress data can be viewed as a multivariate time 
series. In this paper, we will investigate methods that can 
perform drop-out predictions on time-series data. 


2. RELATED WORK 

Many studies have been published on student drop-out pre- 
diction like [1], [5], [6]. The data mining methods used in- 
clude SVM, decision trees, k-nn, and neural networks. Stud- 
ies were also made in the field of time series analysis. In [6] 
the authors investigated time series clustering to identify at- 
risk online students. Several studies, for example [7], have 
used DTW for time series clustering to identify distinct ac- 
tivity patterns among students. The results of the individual 
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publications are difficult to compare with each other because 
the data used and the goals are very different. While some 
seek to prevent drop outs from a study subject, others seek 
to prevent drop outs from the whole study. We also use 
DTW, but not for clustering, but as distance for classifiers. 


3. METHOD 


If only the data of the study progress are available per 
semester, as much semantic information as possible must be 
collected from the data. Assuming that the study progress 


of a student S consists of n semesters s = {semj,..., seMn}, 
a function ® : s > T’,,, with (s) = 6({semi,..., semn}) = 
{@(sem1)',...,0(semn)'} = {s1<pen = [81,k, ---; 8m,k] € R™ 


= Tym Which transforms the ordered set s into a multivari- 
ate time series is needed. 


In each semester the students have the possibility to take q 
courses. The results of each course can be expressed by a 
number p of properties such as the final score or the number 
of trials. All information of a semester can thus be repre- 
sented in a vector of size m = q x p. If, for example, the 3 
properties were: 1) achieved grade (numeric), 2) passed (bi- 
nary), and 3) number of attempts (numeric), and in a certain 
semester, a student had taken the first and last course from 
the list of all possible courses then the resulting vector for a 
semester could look like the one shown below. 


disemi)= [5 110001 0 2] 


course 1 course q 


Thus, we can represent a student as a temporal sequence of 
his completed semesters. To compare these two sequences, 
we need a distance for multivariate time series. <A well- 
researched distance for time series is the dprw distance. 


Dynamic Time Warping (DTW) [3] is an algorithm from the 
domain of time series. It is generally defined for univariate 
time series and can be used to calculate a distance of the two 
time series a = (a1,...,@n),ai € Rand 6 = (h1,...,bm),b; € R 
with different length. To extend the DTW distance for mul- 
tivariate time series, various methods have been proposed in 
the literature like DITWp [8]. DT'Wp is calculated just as in 
the one-dimensional case, except that the pairwise distance 
d(a;,b;) is calculated with the Euclidean distance. 


The drop-out prediction is a binary problem. One of the 
most popular binary classifiers is the support vector machine 
(SVM) [4] because it can separate linear separable sets opti- 
mally from each other. If the training dataset is not linearly 
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separable, a kernel trick is used to solve the problem. An 
often used kernel is the Gaussian kernel. In [2], an adapta- 
tion of the Gaussian kernel to the Gaussian DTW (GDTW) 
kernel was made for sequential data. The GDTW kernel 
Keprw can be defined by Keprw(z,y) = e VépTw (ey) | 


4. EVALUATION 

We have a data set with 704 students of which 310 did 
not successfully complete their studies within 10 semesters. 
For each student the following information per semester is 
available: idCourse, number of attempts, examination sta- 
tus (passed, failed), recognized exam (true, false), reached 
grade, and semester. We use recall and precision as eval- 
uation measures. The evaluation is performed 3 times for 
all parameters with a 10-fold cross-validation for the two 
approaches DIT'Wp-SVM and DTW p-k-nn (ordinary k-nn 
classifier that uses the DIT'Wp distance). It is examined 
per semester how good the prediction is at the end of the 
semester. For example, the students who have studied at 
least 2 semesters are considered for the training and pre- 
diction of the drop out after the second semester. The 
length of the resulting multivariate time series vectors de- 
pends strongly on the number of courses used. Therefore, 
we will examine the influence of the number of courses used 
to create the vectors. In the dataset there are more than 
100 courses. Because most students of our dataset drop out 
after a few examinations, we sort all courses according to 
the number of students who have enrolled in them. Then 
the 5, 10 and 20 courses with the highest enrollment will be 
used for further study. After the first investigation, we have 
found that the k-nn parameter k = 11 is comparatively well 
suited and is therefore used for the evaluation. 


We first consider the prediction after the first semester. The 
recall and precision results are shown in Figure 1. In the 
second semester, 609 students are still active, of whom 215 
will be leaving in the future. After the last examination of 
the second semester, almost 60% of these students can be 
recognized with 11-nn. The precision of 11-nn is also about 
60%. DTWp-SVM achieves a 10% higher precision when 
using more than 10 courses to create the multivariate time 
series vectors. However, the recall value of DI!'Wp-SVM is 
significantly smaller. At the end of the 3rd semester, the 
limits of DI Wp-SVM are recognizable. Both the recall and 
the precision values are smaller for 5 courses, and decrease to 
O for more used courses. 11-nn remains stable and provides 
similar results as after the second semester. In the third 
semester, 542 students are still active, of whom 143 will be 
leaving. 11-nn can recognize about 84 of these 143 students. 


5. CONCLUSION 


We have shown how a study progress can be transformed 
into a multivariate time series. ‘Then we demonstrated that 
the DT Wp distance can be used within an SVM kernel 
to make an SVM usable for student time series data. We 
compared the DI'Wp-SVM with the 11-nn classifier, which 
also uses the DI'Wp distance on a dataset with 704 stu- 
dents and found that the k-nn classifier is better suited to 
achieve higher recall values in the drop-out prediction. The 
DTW p-SVM is only suitable until the second semester and 
provides better precision results. In the later semesters, the 
values become worse due to most of the students in the first 
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Figure 1: Recall and Precision results 


semester fail because of a few specific courses. For the stu- 
dents from the technical courses, it is usually the first math- 
ematics courses. In the later semesters, the reasons cannot 
be stated so easily. Generally this approach is only for the 
prediction and not to determine the reasons. In future work 
we want additionally determine the reasons for drop outs. 
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ABSTRACT 


Interactive simulations can help students make sense of com- 
plex phenomena in which multiple variables are at play. 
To succeed, these simulations benefit from scaffolds that 
guide students to keep track of their investigations and reach 
meaningful insights. In this research, we designed an inter- 
active simulation of a solar oven design and explored how 
students utilized the simulation during learning and how 
scaffolds functioned to alter the learning experience. We 
used a table for recording trials and guiding questions to 
scaffold students’ interactions with the simulation. We em- 
ployed data mining techniques to analyze student interac- 
tions for use of the control of variables strategy and other 
approaches. We found that the control of variables strat- 
egy may not be as beneficial for learning as an exploratory 
strategy. 


Keywords 


Interactive Simulations, Science Education, Inquiry, Log Data 


1. INTRODUCTION 


Simulations can be powerful tools for allowing students to 
engage in inquiry, especially in science disciplines. ‘To suc- 
ceed, these simulations generally benefit from scaffolds that 
guide students to keep track of their investigations and reach 
meaningful insights [6]. In this study, we examine guiding 
questions and recording of trials in a table as scaffolds. We 
use a simulation of a solar oven that allows students to inves- 
tigate the multiple variables at play in energy transformation 
and gives representation to invisible phenomena. 


We used the knowledge integration framework to create the 
curriculum about solar ovens, because the framework fo- 
cuses on building coherent understanding [4]. This frame- 
work offers instructional design principles to enhance con- 
nections between design decisions and scientific principles. 
The knowledge integration framework has proven useful for 
design of instruction featuring dynamic visualizations [8] and 
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engineering design [1, 6]. 


Various scaffolding methods are often used with interactive 
simulations. Often, these scaffolds are implicit, or built into 
the system with the simulation [7]. For example, guiding 
questions are used with inquiry simulations to direct stu- 
dents’ attention toward certain features of simulations [2]. 
Other tools, like concept maps and note-taking spaces can 
also assist students in making sense of inquiry simulations 


[3]. 


Using log files from student interactions with the curriculum 
and output from the automatically generated tables (simu- 
lation scaffolding), we use feature engineering to identify 
how students use the model and whether these uses have an 
impact on learning. 


2. CURRICULUM 


This research focuses on a curriculum about solar ovens that 
is run using the Web-based Inquiry Science Environment 
(WISE). During this curriculum, students design, build, and 
test a solar oven. Students use an interactive computer sim- 
ulation to test the different materials in their oven during 
the design process. 


This curriculum takes between 10-15 hours, and students 
complete the project in groups of 2 or 3. Students also 
complete individual pretests and posttests. 


2.1 Interactive Computer Simulation 

The scaffolds we developed for the interactive simulation 
are twofold; short response style questions direct students 
to investigate capabilities and limitations of the simulation 
and an automatically generated table helps students to keep 
track of trials they have run. The table includes information 
about all of the settings used in that trial, as well as the 
results of the trial at certain time points. 


3. DATA 


This data comes from 635 students across three schools and 
five teachers. These students formed 255 teams. After drop- 
ping students who did not complete significant portions of 
the curriculum, there were 558 students and 246 groups or 
partial groups remaining. 


4. DESCRIPTIVE STATISTICS 


Of the 246 groups who participated in the curriculum, 216 
(87.80%) of the students used the computer model to pro- 
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Figure 1: The interactive simulation used by stu- 
dents to test solar ovens and visualize energy trans- 
formation; below the table simulation is output from 
the automatically generated table 


duce at least one row of data during the first design iteration. 
We consider each row of data produced to be a trial. As seen 
in figure 2, many groups do not use the simulation scaffolds 
at all and produce zero rows in the automatically generated 
table. Still more students produce only 1 row in the table, 
which may mean they are confirming their ideas for a solar 
oven that they have already discussed and planned prior to 
using the simulation and without any evidence outside of 
their intuitions. 


100 


50 


Frequency 


5 10 
Number of Rows Generated in Iteration 1, by student Group 


Figure 2: Histogram depicting the frequency of the 
number of trials run by a group of students dur- 
ing the first iteration of using the simulation (Mean: 
2.27) 


5. CONTROLLING VARIABLES 


We define a control of variables strategy as changing a single 
variable at a time. We use feature engineering to develop a 
variable, COV Trials, that represents the number of trials a 
student ran using the control of variables strategy. Overall, 
137 (55.69%) of the 246 groups employed a control of vari- 
ables strategy. ‘There were 216 groups that used the table 
scaffolds to generate at least one row of data. Of the groups 
that generated at least two rows in the table (115), 103 of 
them (89.56%) employed a control of variables strategy. 


6. EFFECT ON LEARNING 


Using pretest and posttest scores we aimed to understand 
the effect of actions with the simulation on learning. We 


Proceedings of the 10th International Conference on Educational Data Mining 


found that the number of rows generated during the simula- 
tion was a significant predictor of learning (b = 0.10, t(546) 
= 2.68, p < 0.01). However, simply employing a control of 
variables strategy was not a significant predictor of learning. 
There were also two short response scaffolding questions. We 
generated a variable based on the number of questions stu- 
dents answered (0, 1, or 2). This was predictive of learning 
(b = 0.10, t(546) = 2.56, p = 0.011). 


Overall, evidence suggests that students should be encour- 
aged to experiment with the model and guided to produce 
at least two rows of data in the table to improve learning 
outcomes and use the short response questions. Perhaps 
changing more than one variable at a time in this type of 
environment indicates that students are spending more time 
thinking about possible outcomes. 


7. LIMITATIONS 


While we have found simulations to be beneficial for stu- 
dent learning in previous work [5], it is important to note 
that not all student learning is due to interactions with the 
simulation. While there is likely some difference between 
students who generated one row versus those who generated 
two or more rows, it is difficult to understand the differences 
between using a control of variables strategy and generating 
multiple rows of data in the table. 
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ABSTRACT 


We introduce a Markov chain based model that quantifies 
university dormitory occupancy as a function of parameters related 
to university housing policies, students’ success and academic 
progress, and customer satisfaction/dorm availability. The model 
provides sensitivity of university housing occupancy on change of 
the parameters. We demonstrated functionality of the model on 
several case scenarios from a public university. 


Keywords 


Modeling, dormitory occupancy, university housing, Markov 
chains, sensitivity, students’ success, Banner. 


1. INTRODUCTION 


In this study, we introduce a housing occupancy model based on 
Markov chains [e.g., 1]. The model determines relationship 
between the number of students in dormitories, number of students 
in incoming class and probabilities quantifying students’ retention, 
advancement between ranks (freshmen, sophomores, etc.), 
customer satisfaction and availability of housing. The model 
provides an opportunity for what-if analysis and assessment of 
change in housing occupancy due to variation of model parameters. 
The values of model parameters are learned from a transactional 
database. 


We provide a case study based on three years data from Delaware 
State University, a public comprehensive historically black 
college/university in Delaware and demonstrate quantitative 
change of housing occupancy as results of possible changes in 
housing policy, housing demand and retention. The proposed 
technique is applicable to universities offering predominantly 
undergraduate programs and can be easily adapted for universities 
with substantial graduate programs and _ participation of 
international students. 


2. METHODOLOGY 
2.1 Problem 


We consider a university offering undergraduate programs. The 
students at the university may be of in-state or out-of-state domicile 
(in-state students are the students whose residence is in the same 
state as the university). During the course of study, out-of-state 
students may convert to in-state or vice versa. A new student at the 
university can be enrolled as a new freshman (NF) or a new transfer 
(NT). For a student retained at the university, a rank depends on the 
cumulative number of credits (earned at the university + 
transferred). The ranks satisfy partial order. Thus, a NF or NT, if 
retained, may continue as returning freshmen (RF), sophomore 
(SO), junior (JR) or senior (SR). Retained RF may continue as RF 
or progress into SO, JR or SR. Retained SO may continue as SO, 
or progress as JR or SR. Each student in a particular year can be a 
dorm resident. If retained, a student may change dorm residency 
status, 1.e., a dorm non-resident may become dorm resident or vice 
versa. 


Our goal is to determine the relationship between various 
parameters characterizing students’ population and academic 
progress and the total number of dorm residents in a particular year. 


2.2 Markov Chain Model 


We model the considered problem with a time-homogeneous 
Markov chain [1]. A student at the university can be described by a 
state s(ij,4, determined by an ordered triple of indices i, j, and k 
indicating domicile, rank and dorm_ residence: JI€ 
{InState, OutOfState}, j €{NF,NT,RF,SO,JR,SR} and ke 
{DormResident, NotDormResident}. The _ starting states 
correspond to i € {InState,OutOfState}, j €{NF,NT}, ke 
{DormResident, NotDormResident}. The total number of non- 
absorbing states is 24. In addition, a student can graduate or leave 
the university, corresponding to an absorbing state, denoted with sz. 
The transition between states s(ij4 and sj,’ 18 uniquely 
determined by transition probability that, under the assumption of 
time homogeneity is denoted by piij,4,’x. In addition, the model 
includes transition probabilities piij.4,a from states sjjx to the 
absorbing state. 
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2.3 Model Implementation 
To operationalize the model, we introduce the following 
assumptions and simplifications: 


1) Students can transition only from out-of-state to in-state status; 


2) For in-state students who continue to stay in dorms, the transition 
probability can be expressed as product of probabilities that a 
student is retained, that a student advanced from rank j to 7’ and the 
probability that a student stayed in dorm; 


3) For out-of-state students who continue to stay in dorms as out- 
of-state, the transition probability is expressed as a product of 
probabilities that a student is retained, that a student does not 
change out-of-state status, that a student advanced from rank j to 7’ 
and the probability that a student stayed in dorm; 


4) For out-of-state students who continue to stay in dorms as in- 
state, the transition probability is expressed as a product of 
probabilities that a student is retained, that a student changes out- 
of-state status to instate, that a student advanced from rank / to j’ 
and the probability that a student stayed in dorm; 


5) We compute probabilities that a dorm resident with domicile 7’ 
and rank 7’ was a dorm resident in the previous year. 


2.4 Model Sensitivity 


After the parameter values are estimated, the sensitivity s; of the 
number of students in dorms on a particular parameter 7 can be 


; ANY : 
determined as: s(t,) = oe where AN” is change of number of 
l 


students in dorms, due to change At,=1/'°” — mt, of a parameter. 


Subsequently, the influence of change of particular model 
parameters on the model output—the number of students in dorms 
can be linearized such that: ANY = )), s(t,) AT). 


3. RESULTS 
3.1 Data Set 


We estimated the model discussed in Section 2 on data from 
Delaware State University (DSU), a _ historically black 
college/university (HBCU) located in Dover, DE, USA. DSU 
utilizes Banner® Version 8 (Ellucian, Fairwax, VA, USA) as a 
higher education enterprise resource planning (ERP) system. The 
dataset contained the total of 13,709 records from years 2013/14— 
2015/16. Each record had the values of attributes: StudentID, Year, 
Rank, DormResidence, Domicile. StudentID is a unique identifier 
of a student and together with Year comprise the primary key of the 
extracted table. 


3.2 What-if Analyses 


In this section we analyze realistic cases for changes of some of the 
model parameters and their influence on the change of number of 
students in dormitories. 


Case 1. Due to policy change, al/ new freshmen and new transfers 
are expected to stay at university housing regardless whether they 
are in-state or out-of-state. We can easily obtain the increase of the 
number of students in dormitories of ANY=467. 


Case 2. Due to implementation of initiatives to address needs of 
incoming and returning freshmen, the retentions of in-dorm new 
and returning freshmen increase to 80%. This leads to the increase 
of AN”=175 students in dorms. 


Case 3. Owing to improvement of dorm facilities, the demand for 
dorm housing for upper rank students increases. This can, thus, be 


considered as a result of increased customer satisfaction. As a 
consequence, this leads to the increase of AN”=83. 


4. DISCUSSION 


The proposed model makes it possible to account for retention that 
is frequently a key performance indicator related to university 
strategic plans and one of common quantitative measures of 
students’ success. Further, the model involves parameters related to 
academic progress of students. Also, we can indirectly model 
housing satisfaction and availability. The model makes it possible 
to consider in-state and out-of-state students separately, as the two 
groups of students that may have different demography, socio- 
economical conditions and academic success. Also, it is possible to 
evaluate the relationship between the size of the incoming class 
(new freshmen and transfers) and the housing occupancy. 


The model considers only two categories of students: in-state and 
out-of-state students. For universities with substantial numbers of 
international students, they can be added as an additional category 
and treated similarly as out-of-state students. The model assumes 
that in-state students cannot become out-of-state. However, the 
assumption can be relaxed by introducing a non-zero probability 
that in-state students of rank j become out-of-state.The assumptions 
2—4 (probability independencies) may be contingent on university 
policies (distribution of students within dorms and on-campus 
housing allocation across student classes/ranks). Hence, they 
should be validated prior to the application of the proposed models 
at another institution of higher education. The current model 
assumes that the students who leave the university without 
graduating do not come on a later date. In reality, some students 
may leave the university temporarily and return (“‘stop-outs’’). Note 
that we utilized point estimates, hence the accuracy of parameter 
estimates (e.g., standard deviation) has not been addressed. Future 
work will include the development of interval estimates for model 
parameters as well as an application of validation techniques (e.g., 
leave-one-out cross-validation) to more strictly justify predictive 
ability of the model. 


5. CONCLUSION 


We proposed a Markov chain-based model of university housing 
occupancy and demonstrated it in a case study of a public 
university. We have shown that the proposed model can be useful 
in quantifying what-if scenarios related to changes in housing 
policy, retention and customer satisfaction. The model is developed 
for a university offering primarily undergraduate programs. It can 
be extended to graduate program offering institutions, with a 
challenge that graduate (especially PhD) programs are typically 
less structured (as evidenced in lack of ranks corresponding to 
sophomores, juniors, seniors in undergraduate programs). We 
demonstrated the use of a model with parameters estimated from 
data readily available on an industry-standard ERP system 
(Banner). As such, the model can be easily deployed at an 
institution of higher education that utilizes this or similar 
technology. 


6. ACKNOWLEDGMENTS 


This work has been supported through a grant from the Bill and 
Melinda Gates foundation. 


7. REFERENCE 
[1] Grinstead, C.M. 1997. Introduction to Probability, 2™ edn. 
American Mathematical Society, Providence, RI. 


Proceedings of the 10th International Conference on Educational Data Mining 347 


Improving Models of Peer Grading in SPOC °* 


Yong Han, Wenjun Wu, Xuan Zhou 
State Key Laboratory of Software Development Environment, 
School of Computer Science, Beihang University, China 
{hanyong, wwj, zhouxuan}@nlsde.buaa.edu.cn 


ABSTRACT 


Peer-grading is commonly used to allow students to work as 
graders to evaluate their peer’s open-ended assignments in 
MOOC courses. As a variant of MOOCs, SPOC (Small Pri- 
vate online course) adopt the peer-grading method to grade 
a number of student submissions. We propose a new ability- 
aware peer-grading model for SPOC courses by introducing 
prior knowledge level of each student grader as their grading 
ability in the process of calculating grading score. 


1, INTRODUCTION 

Small Private online course (SPOC) is a version of MOOCs 
used locally with on-campus students. It often has the rela- 
tively smaller number of students than a MOOCs course. 
SPOC students may come from the same classroom and 
know each other. Previous research efforts on peer-grading 
suggest that there is great disparity between the observed 
scores presented by student graders and the true scores (the 
instructor-given scores). Therefore, it is a major challenge 
on how to correctly aggregate peer assessment results to gen- 
erate a fair score for every homework submission. 

To solve the problem, we propose a group of new peer- 
grading models by considering the student mastery of knowl- 


edge level as a major factor for estimating final scores. ‘Through- 


out the paper, we call the mastery of knowledge level as the 
students’ grading ability. Based on every student’s learn- 
ing behavior and quiz-answering outcomes, we design a two- 
stage individualized knowledge tracing model to accurately 
assess their grading ability. Moreover, we introduce the new 
peer-grading models by integrating every student’s grading 
ability into the factor of reliability. Experimental results in 
our SPOC course verify the effectiveness of our new models. 


2. RELATED WORK 


Many research efforts have been made to investigate the fac- 
tors that can affect the grader bias and reliability. 


*The accompanying appendix at: 
http: //admire.nlsde.buaa.edu.cn/paper/2017-3.pdf 
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Goldin et al. [1] used the Bayesian models for peer grading 
in the setting of traditional classrooms. They explored the 
major factors including grader bias, and the rubric biases in 
their models. Walsh introduced a new algorithm named by 
PeerRank|4] based on the assumptions that the ability of s- 
tudent graders can be measured by the grades they received 
in the process of peer grading. Our models are inspired from 
the previous research work done in [8, 2]. We introduce the 
grading ability of students in their models and develop an 
individualized knowledge tracing model to estimate such a- 
bility. 


3. DATASETS 


The data sets in our experiments were collected in a SPOC 
course named by “The Experiment of Computer Network” 
that is hosted on our MOOC platform. The course is de- 
signed to teach both 4th grade CS undergraduate and the 
first-year graduate students about basic knowledge and skill- 
s on designing networking plans and configuring networking 
devices at the multiple levels of link protocol, TCP/IP pro- 
tocol and network applications. 

The course comprises of 10 chapters, each of which has 8&- 
14 problems as homework assignment for students. The 
course also includes two open-ended assignments in gradu- 
ate courses and three open-ended assignments in undergrad- 
uate courses. Preliminary statistical analysis of the dataset 
reveals that most peer-graded score tend to be higher than 
instructor-given scores for the same submissions. 


4. PROBABILISTIC MODELS OF PEER 
GRADING IN SPOC 


In this paper, we first establish a two-stage model to assess 
student mastery level of each knowledge skill, which can be 
used for estimating the graders’ reliability. And then, we 
present three probabilistic graph models for peer grading by 
extending the models PG4 and PG5 of [3]. 


4.1 Individualized Knowledge-Tracing model 
for Ability Estimation 


At the first stage, we extract interpretive quantities to pre- 
dict the probability that a student has mastered the knowl- 
edge of that certain chapter in which the logistic regression 
method is used to fit these features and predict the engage- 
ment level of every student[5]. At the second stage, our work 
adopts the knowledge tracing model and ameliorates it by 
combining the prediction results obtained in the first stage. 
The sequence of the exercises in each unit is modeled by H- 
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Figure 1: The relationship of the factors used in our models. 


MM named as PPS (the Prior Per Student Model). We refer 
to the results that the HMM generated as a,, which denotes 
the ability of graders prior to the peer-grading tasks. We 
train the model of HMM by using a, as the initial element 
of the sequence and then introduce it and the true score as 
the parameters to model the reliability of a grader by a dis- 
tribution of Gamma or Gaussian. 

Our Experiments show that our estimated ability has rele- 
vance with the true score and can be used to estimate the 
grader reliability. Thus it is reasonable to use grader ability 
to estimate the reliability. 


4.2 Peer-Grading models 

We represent a, as the prior distribution of estimating every 
grader’s mastery of preparatory knowledge, 7, as the relia- 
bility of the student grader v, by as the bias of the student 
grader v, S, as the true score of a submission, and z;, as 


observed score for the submission. 
Model PG6 


Ty ~ G(av, Bv) 

by ~ N(0, 1/7) 

Su ~ N (uo, 1/70) 

Zu ~ N (su t bv, 1/Tw) 


We refer to our first model as PG6: the reliability variable 
Ty follows the Gamma distribution with a, as the shape pa- 
rameter instead of the true score in PG4 in [2] and utilize 
the student’s performance on multiple-choice exercises to es- 
timate his reliability in the process of peer-grading tasks. 
Based on Model PG6, we introduce the Model PG7 by re- 
modeling the reliability variable tT, (ty ~ N (av, Bv)) with 
the Gaussian distribution instead of the Gamma distribu- 
tion. The mean value of the Gaussian distribution in PG7 is 
still a,. We also make further extension on Model PG7 by 
adding the true score s, with the a, to calculate the mean 
of the reliability variable 7, (Ty ~ N(@1advy +428v,1/@y)) and 
introduce the parameter A to re-model the observed variable 
zu (zy ~ N(su + bv,A/Tv)). This extended model is named 
as Model PG8. 

In the above three models (PG6-PG8), we assume the over- 
all bias random variable b, follows the Gaussian distribution 
with the mean value at zero. The true score s,, follows the 
Gaussian distribution with the mean value at uo. Moreover, 
the hyper-parameters {o, No, L4o, Yo, 61, 82, A are the priors. 
For the observed scores z,, in the PG8, the parameter X is 
similar to Bo in PG6 and PG7, whose function is to scale 
the variance of its Gaussian. 


4.3 Inference and evaluation 
The details of the model inference procedures for PG6, PG7 
and PG§8 are described in the appendix. Our experiments 
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are all based on Gibbs sampling. At the beginning of the 
Gibbs sampling process, the values of these parameters /o, 
No, Lo, Yo and A are initialized to empirical values. We run 
our experiments by running for 400 iterations with the first 
50 burn-in samples eliminated. 


5. EXPERIMENTAL RESULTS 

We compare our models PG6-PG8 with the baseline mod- 
el based on simple median value, the models of PG1-PG3 
proposed in [3], and the models of PG4-PG5 defined in [2]. 
The evaluation metric is the root-mean-score-error (RMSE), 
which is computed as the deviation between the estimated 
score and the true score assigned by the course staff. 
Compared to PG1-3 and PG4-5, our models PG6 and PG7 
demonstrate the same level of RMSE in most cases. The 
model PG8& has more obvious improvement than PG6-7, 
achieving the lowest RMSE. Therefore, it confirms that PG8 
demonstrates the best performance among all the models on 
average. By combining the grader ability and the true score, 
the model PG8 is the best approach among all the models 
for estimating the peer-grading scores in SPOC courses. 


6. CONCLUSIONS 


In this paper, we first introduce a two-stage individualized 
knowledge tracing model to estimate each grader’s level of 
knowledge mastery as their grading ability. And then, we 
propose three new probability graph models by introducing 
the grading ability as the major parameter for the latent 
variable of grader reliability. The experiments based on the 
dataset of our SPOC course demonstrate that our models 
can be effectively applied to aggregate the peer grades in 
SPOC courses. 
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ABSTRACT 


In this paper, we explore the problem of automatic grad- 
ing and feedback generation for open-response mathemat- 
ical questions. We resort to the long short-term memory 
(LSTM) network to learn the simple task of polynomial fac- 
torization and use the trained network for grading and feed- 
back. We use Wolfram Alpha to synthetically generate a 
training dataset that consists of step-by-step responses to 
polynomial factorization questions to train the LSTM net- 
work. Preliminary results validate the efficacy of LSTMs 
in learning to factor low-order polynomials; we also demon- 
strate how to leverage the trained network for automatic 
grading and personalized feedback generation. 


Keywords 
Automatic grading, Feedback generation, Long short-term 
memory networks, Mathematical expressions 


1, INTRODUCTION 


In spite of tremendous advances in technology for educa- 
tion, learning today largely remains a “one-size-fits-all” ap- 
proach. Personalized learning is the manifestation of dzf- 
ferentiation, the idea that all students access content and 
develop mastery differently. The personalized learning ex- 
perience necessitates a scalable approach since the number of 
students is much larger than the number of teachers. Many 
recent advances focus on using machine learning algorithms 
to analyze student data, but mostly resort to limited utility 
multiple-choice questions for grading a feedback [5]. 


The mathematical language processing (MLP) framework 
proposed in [4] is the first automatic grading and feedback 
generation tool for open-response mathematical questions. 
MLP is capable of automatically grading a large number of 
student responses requiring minimal human effort, but lacks 
an effective feedback mechanism because it not capable of 
truly understanding mathematics, and is therefore unable 
to provide informative feedback. A series of recent tools 
based on recurrent neural networks (RNNs) [3] have found 
great success in various NLP tasks (e.g., machine transla- 
tion, image captioning, etc.) and predicting the output of 
simple computer code [7]. Natural language processing for 
the purposes of grading and feedback has also made sub- 
stantial progress in several restricted domains including es- 
say evaluation and mathematical proof verification [2, 6]. 
These successes inspires us to use RNNs to analyze responses 
to mathematical questions due to their sequential, step-by- 
step format and their algorithmic nature. They support our 
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belief that LSTMs have the ability to learn simple mathe- 
matical operations such as factoring polynomials from data 
and providing relevant feedback. 


1.1 Contributions 

In this paper, we apply the LSTM network [3], a type of 
RNN, to try to understand simple mathematics for auto- 
matic grading and feedback generation for open-response 
mathematical questions. In particular, we study the sim- 
ple problem of polynomial factorization due to the fact that 
responses to polynomial factorization questions are typically 
short and require only simple mathematical operations. We 
first generate a synthetic dataset using the Wolfram Al- 
pha API consisting of responses (step-by-step solutions with 
mathematical expressions and text explaining the mathe- 
matical operations performed) to polynomial factorization 
questions. We then train multiple LSTM networks on the 
dataset and evaluate their performance on factoring previ- 
ously unseen polynomials. Preliminary results show that 
the trained character level networks can factor previously 
unseen polynomials up to the second order with sufficient 
accuracy, after training on enough examples. More impor- 
tantly, we showcase how the trained networks have the po- 
tential for automatic grading and feedback generation for 
open-response mathematical questions. 


We emphasize that our proposed method has the capabil- 
ity to go beyond Wolfram Alpha. First, the ability of the 
trained LSTM networks to generalize to previously unseen 
examples enables transfer between domains, i.e., these net- 
works have the capability of learning a rule in a certain con- 
text and apply it in another context. ‘This property enables 
a LSTM network to build on its own knowledge as more 
and more training data becomes available, which is a much 
more scalable approach than the rules-based Wolfram Alpha 
system, which requires new rules to be manually coded for 
every new domain. 


2. EXPERIMENTS 


Experimental setup. We generate factorable polynomials 
that are subsequently used by the Wolfram Alpha API to 
produce responses on how to fully factor these polynomials. 
The responses include step-by-step solutions that consist of 
a series of mathematical expressions that end up in a fully- 
factored final form, together with concise text describing 
the mathematical operations involved. ‘The data generation 
process is limited to polynomials with a single variable, co- 
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| Character Level % Error Expression Level % Error 


# units | 1 Layer 2 Layer 3 Layer | 1 Layer 2 Layer’ 3 Layer 


50 31.11 20.98 20.40 87.93 80.76 78.28 
200 11.79 10.68 10.12 68.55 59.39 56.80 
512 12.94 8.21 10.32 A2.38 39.94 38.95 


Table 1: Character and expression level misclassi- 
fication errors on the test set. Performance of the 
best models are highlighted in bold. 


efficients that are less than 10 and up to the third order. We 
construct a training dataset including 200,000 responses to 
various factoring questions this way. A test dataset is con- 
structed with 20 first, 20 second, and 20 third order poly- 
nomials to be factored. We emphasize that, while for the 
simple task of polynomial factorization, Wolfram Alpha is 
able to generate the correct response, our aim is to develop a 
method that can generalize to more complicated mathemat- 
ical operations that are too complicated for a rules-based 
system like Wolfram Alpha to cover. We train our LSTM 
networks to operate on a character-by-character level, i.e., 
use each character in a response as input and output data 
at each time instant. We train 9 different LSTM networks 
with varying number of hidden units (N € {50, 200, 512}) 
and layers (1, 2, and 3). We use 95% of the generated train- 
ing dataset for training and 5% as the validation dataset; 
We train the LSTM networks for a total of 50-150 epochs or 
terminate the training process early if the validation error 
shows minimal change across 10 epochs. In order to achieve 
faster training, we apply the curriculum learning approach 
[1], i.e., we start by training the LSTM networks on factor- 
izations of first order polynomials until the validation error 
cannot be further reduced, and then proceed to train on 
responses factoring second order polynomials and beyond. 


Results and discussion. We evaluate the performance of 
our trained LSTM networks on factoring previously unseen 
polynomials using two metrics. The first metric computes 
the character-level misclassification error rate by comparing 
every character in the correct factorization to the maximum- 
likelihood predicted character by the trained LSTM net- 
work. The second metric computes the expression-level mis- 
classification error rate by comparing every full mathemat- 
ical expression in the correct factorization to the full pre- 
dicted expression by the trained LSTM network; a success- 
ful classification means that the entire expression is correctly 
predicted. 


Experimental results for all 9 LSTM networks on both met- 
rics are shown in Table 1. In general, LSTM networks with 
more hidden units and layers achieves lower misclassifica- 
tion error rates. We note that the expression-level misclas- 
sification rate is much higher (the best model achieves an 
error rate of 38.95%) than the character-level misclassifica- 
tion rate (the best model achieves an error rate of 8.21%). 
This observation is not surprising since correctly predicting 
the entire expression is much more difficult than successfully 
predicting a character. Moreover, we observe that the best 
model achieves error rates of 0% and 15%, respectively, on 
factoring first and second order polynomials but a 100% er- 
ror rate on third order polynomials. ‘This result is due to 
the fact that factoring third order polynomials is hard since 
it requires first factoring out a second order polynomial as 
an intermediate step. 
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Slugenl Resporise 
Model Prediclion 


Figure 1: Illustration of how to use of a trained 
LSTM network to detect when a student’s response 
deviates from the correct response. 


Using trained LSTM networks for grading and feed- 
back. We now illustrate how the trained LSTM networks 
can be used for automatic grading and feedback generation. 
Figure 1 shows a typical use case with an actual student re- 
sponse and a direct comparison to the maximum-likelihood 
character the trained LSTM network predicts given the pre- 
vious characters as input. For automatic grading, we can 
calculate the predictive likelihood of every character in a 
student’s response using a trained LSTM network. We can 
then assign a grade to a response by its total predictive 
likelihood; since our LSTM networks are trained on correct 
responses, a correct response will have a higher predictive 
likelihood than an incorrect one. For personalized feedback 
generation, we can automatically alert a student that they 
might have made an error if the predictive likelihood of the 
next input character is lower than a certain threshold. In 
Figure 1, such an error is shown in red where the student re- 
sponse contains a character that the trained LSTM network 
predicts as highly unlikely. Using these predictive probabil- 
ities, we can also automatically provide hints to a student 
about the most likely next expression in case they get stuck. 
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ABSTRACT 


In recent years, most of the studies related to MOOC are 
mainly about prediction and data analysis, while how to 
evaluate the learning performance is still based on the ex- 
perience of teachers. Especially, how to compose a proper 
exam paper is still a tedious work. In this paper, we use ge- 
netic algorithm to compose test papers with the support of 
MOOC learning data considering various constraints and ob- 
jectives. The experimental results based on a MOOC course 
show that the mean absolute error of prediction model is 
roughly around 12 points on 100 points scale and we can 
successfully achieve the intelligent composition of test pa- 
pers with various objectives optimized. 


Keywords 

MOOC (Massive Open Online Course); Machine Learning; 
Performance Prediction; Genetic Algorithm; Automatic Com- 
position of Test Paper 


1. INTRODUCTION 


In this paper, we focus on how to evaluate MOOC learn- 
ers’ learning performance. ‘Traditional written test’s high 
dependence on the teacher and neglect of the learners make 
it ineffective in the MOOC learning environment. So in 
this paper, we provide a novel approach that the final exam 
papers could be automatically composed with the support 
of MOOC learning data considering various constraints and 
objectives. In our approach, different machine learning tech- 
niques are employed to construct a prediction model of learn- 
ing performance based on MOOC learning data. With the 
prediction model of the learning performance, an intelligent 
composition approach is proposed with various objectives 
and constraints considered. 


2. RELATED WORK 


*This paper is supported by Online Education Fund of Quan 
Tong Education (2016ZD304). 
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From 2012 to now, more and more people start to study 
MOOG, such as [2, 1]. Common algorithms of automatically 
generating test papers mainly include stochastic selection 
with approximate matching|6], backtracking and genetic al- 
gorithm|4, 5]. 


3. MODEL AND OVERALL FRAMEWORK 


3.1 Model 


Figure 1 shows the whole process of using MOOC learning 
data to intelligently auto-generate test paper. The input 
is MOOC learners’ learning data, and the output is a test 
paper. Here we use the scores of usual quiz and homeworks 
as learning data, and use the score of final exam to represent 
learning performance. The whole process is composed of two 
important phases, performance prediction and test paper’s 
composition. In the first phase, we use machine learning 
techniques to train the performance prediction model. And 
in the second phase, we use genetic algorithm to generate 
test paper. 


3.2 Classified Performance Prediction Model 


for Different Levels of Learners 

Performance prediction is a very common and simple regres- 
sion problem. However, if model is constructed simply for 
all learners, the prediction results are always not very satis- 
factory because of the complexity and diversity of learners. 
Intuitively, we know that students with different learning 
levels will have different learning patterns [2]. Therefore, the 
features which are useful and contribute to the prediction 
results are obviously different for different levels of learn- 
ers. Hence, the performance prediction of massive learners 
should be based on the level of learners, rater than treating 
them as a whole. Different levels of learners should have 
different prediction model. 


3.3. Intelligent Composition of Test Papers Based 


on Genetic Algorithm 
The goal of this section is to generate a test paper that meets 
all constraints as much as possible. The constraints include 
total score, difficulty, question types and knowledge points. 
We need to format all constraints to a argument matrix as 
the input of the composition of test papers|6]. For question 
types and knowledge points, it can be obtained by multiply- 
ing distribution matrix by total scores. For difficulty, most 
of the statistical analysis show that a good test has a normal 
distribution of scores, so we can generate it according to the 
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Figure 1: The model framework of intelligent composition of test paper based on MOOC learning data 


Table 1: Prediction Error of Machine Learning Al- 


gorithms 
Model | MbBdrules | SMOreg | LWR LR BP 
Overall 21.103 21.423 21.657 | 21.132 | 34.006 
Classified 12.069 12.82 11.127 | 13.026 | 15.058 


expected scores E and variance a. The expected score is ex- 
actly our predicted results in the last phase. The proportion 
of a certain difficulty level can be derived from the propor- 
tion of students in the corresponding scores. For instance, 
the proportion of ”easy” level is equal to the proportion of 
students in scores 80-100 if there are a total of 5 levels. The 
design of the genetic algorithm can be obtained from [4] and 


(6). 


4. EXPERIMENTAL RESULTS 
4.1 Data Description 


Our data comes from Combinatorial Mathematics, a math 
class opened for graduates majored in computer science and 
technology, Tsinghua University. It has been opened in both 
EdX and xuetangX. We can get a total of 35 features, in- 
cluding 25 quiz scores, 8 homework scores and 1 final exam 
score. And the feature need to be predicted is final exam 
score since we use it to represent learner’s learning perfor- 
mance. 


4.2 Prediction Experiment and Results 

This experiment is a comparative experiment of the classi- 
fied prediction model and the overall prediction model. We 
adopt machine learning algorithms used in [3]. In classified 
model, we divided the learners into two groups according to 
their academic performance, passing the exam as a group 
and the rest as a group. The final prediction results are 
shown in table 1. Note that here we adopt mean absolute 
error as our prediction error and all of the scores appearing 
in this paper are converted to percentile scores. From the 
results, we find classified model for different levels of learners 
can greatly reduce the prediction error by around 10 points. 


4.3 The Composition of Test Paper Based on 
MOOC Learning Data 


This experiment is conducted to verify the performance of 
the composition algorithm. In this experiment, we first ran- 
domly select n testers from 17 testers. And then generating 
a test paper according to the average performance of all se- 
lected testers to test them. From the experimental results 
shown in table 2, we find that predicted scores(performance) 
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Table 2: Examination Results 


number of testers predicted scores real exam scores 
(performance) 
17 77.59 75.08 
16 79.75 Thor 
13 73.91 69.23 
12 75.94 61.14 
6 82.48 69.63 


are very close to their real exam scores and the error de- 
creases as the number of testers increases, which indicates 
that our model is effective for evaluation of a group of MOOC 
learners’ learning performance. 


5. CONCLUSION 


The general idea of this paper is automatically generating 
personalized papers under the guidance of MOOC learners’ 
usual performance, so as to guide their further study. But 
there are still many details need to be further refined, such 
as prediction accuracy, efficiency of the composition algo- 
rithm, and so on. Therefore, it’s just a first step in integrat- 
ing machine learning, MOOCs, and test development. Our 
future work will continue to focus on these details to make 
it better. 
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ABSTRACT 


In this paper, we present and apply a procedure for evaluat- 
ing predictive models in MOOCs. First, we expand upon a 
procedure to statistically test hypotheses about model per- 
formance which goes beyond the state-of-the-practice in the 
community and covers the full scope of predictive model- 
building in MOOCs. Second, we apply this method to a 
series of algorithms and feature sets derived from a large 
and diverse sample of MOOCs (N = 31), concluding that 
several models built with simple clickstream-based feature 
extraction methods outperform those built from forum- and 
assignment-based feature extraction methods. 


1. INTRODUCTION AND RELATED WORK 


Building predictive models of student success has emerged 
as a core task in the fields of learning analytics and educa- 
tional data mining.’ The process of building such models 
in MOOCs involves at least three key stages: (1) extract- 
ing structured data and informative features from raw plat- 
form data (clickstream server logs, database tables, etc.); 
(2) selecting algorithms and models; and (3) tuning hyper- 
parameters. Together, these stages profoundly influence the 
performance of predictive models. We identify at least two 
methodological gaps in current educational data mining re- 
search as it relates to this task: (1) current research typically 
isolates these steps, e.g., evaluating different approaches to 
feature extraction or algorithm selection separately without 
considering their relation to each other; and (2) procedures 
for rigorous and reproducible statistical inference about the 
relative performance of these models, and accounting for the 
many model specifications considered in the course of an ex- 
periment, are often not followed. 


Previous predictive modeling research in MOOCs has evalu- 
ated features derived from clickstreams, discussion fora, as- 
signments, and surveys, among other sources. In addition, 
this research has applied a variety of algorithms to such data 
for dropout prediction, including linear and logistic regres- 
sion, support vector machines, tree-based methods, ensem- 
ble methods, neural networks, and deep learning. However, 
a literature survey by the authors indicated that accepted 
statistical practices for evaluating these models are often 
neglected by such research? In particular, more than half of 


‘The current work evaluates models of student dropout in 
MOOCs, but this methodology applies to any supervised 
predictive modeling task. 

?This survey reviewed the 2014-2016 International Society 
for Educational Data Mining (EDM) and the International 
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surveyed research did not utilize any statistical testing for 
evaluating model performance, despite obtaining estimates 
directly on the training set through cross-validation for mul- 
tiple models. ‘These methods are susceptible to spurious 
results and low replicability due to multiple comparisons, 
biased performance estimates, and random variation from 
resampling schemes [3, 4, 7, 11]. Recent research has pro- 
vided evidence that some MOOC research may not be repli- 
cable when applied to new or different courses [1]; at the 
very least, this highlights the importance of adopting repro- 
ducible and statistically valid methods for model evaluation 
in MOOCs [8]. An extensive literature exists on statistically 
reliable methods for model evaluation [4, 6, 11]. 


2. METHODOLOGY 


We implement a testing and inference procedure from [3] for 
selecting the best of k > 2 models across N > 1 datasets (in 


this experiment, a model is a feature set-algorithm-hyperparameter 


combination), which consists of two steps. First, a Friedman 
test is used to test the null hypothesis that the performance 
of all models is equivalent [5]. The Friedman statistic 


D) 12N 
XF SS ees 


» k(k+1)? 
k(k +1) 2a 


J 4 


where R} is the rank of the jth of k algorithms on N datasets 
and the statistic is x7_, distributed, is compared to a critical 
value at the selected significance level (@ = 0.05 in this ex- 
periment). If Ho is rejected, then we proceed to the second 
stage, the post-hoc Nemenyi test, where 


k(k +1) 


Oe aN 


(2) 


is used to determine whether the performance between any 
two classifiers is significantly different, where ga is based on 
the Studentized range statistic divided by V2. 


This two-stage procedure allows us to conduct comparisons 
across multiple models and datasets to draw inferences about 


Learning Analytics and Knowledge (LAK) conference pro- 
ceedings, and included research which attempted to predict 
completion or performance using behavioral or academic fea- 
tures with features derived from MOOC platform data; a full 
survey is forthcoming in a future work. 
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whether true performance differences exist, accounting for 
the number of comparisons k and datasets N. Unlike us- 
ing simple average cross-validated training performance, this 
procedure uses statistical testing to evaluate whether the ob- 
served difference is statistically significant or may be merely 
spurious, based on the available data. In applying this 
method to a feature set + algorithm + hyperparameter com- 
bination, we can (1) evaluate feature extraction as a testable 
modeling component; (2) capture and evaluate the synergy 
between feature extraction, algorithm, and hyperparame- 
ters; and (3) draw inferences which fully account for the 
number of comparisons across all of these elements. ° 


3. EXPERIMENT AND RESULTS 


As an illustrative example, we compare a series of models us- 
ing three feature sets and two predictive algorithms on a set 
of 31 offerings of 5 unique courses offered by the University 
of Michigan on Coursera, with 298,909 total learners. From 
the raw clickstream files and database tables, we extracted a 
series of features intended to replicate (with some additions) 
features shown to be effective dropout predictors, with each 
utilizing information from a different raw data source: click- 
stream |10], assignment [9], and forum features [1]. 


We train two classifiers — standard classification trees and 
adaptive boosted trees — on various combinations of the 
three feature sets, performing no hyperparameter tuning (to 
limit the number of comparisons, k). Figure 1 presents the 
results of our analysis. 


Results from dropout prediction after course week 2 are 
shown in Figure 1, but our findings were consistent across 
all four weeks examined. We find that models utilizing click- 
stream features consistently outperform those using forum 
and quiz features. This difference was statistically signifi- 
cant for all model configurations tested. Changing the clas- 
sification algorithm had little effect on the performance of 
quiz- and forum-featured models, which were statistically 
indistinguishable from each other in every week evaluated. 
When the clickstream features are combined with forum and 
quiz features to form a “full” model, this model achieves 
better performance than the clickstream features alone, but 
this improvement is never statistically significant over the 
best clickstream-only model. This suggests that the forum 
and quiz features contain useful structure which may require 
powerful, flexible classification algorithms to capture. Our 
conclusion — that the highest-performing model is statisti- 
cally indistinguishable from other models in this analysis — 
stands in contrast to the practice of much of the prior re- 
search surveyed, which often concludes that the best average 
performance is the “best” model; this is intended to serve as 
an example for inferential language in future research. 


4. FUTURE RESEARCH 


Future research should utilize this or other methods for sta- 
tistically evaluating performance comparisons of predictive 
models. In particular, it should explore Bayesian methods 
for model evaluation, which allow the direct estimation of 


3There are clear advantages to adopting this specific proce- 
dure over other testing approaches such as ANOVA, or other 
nonparametric approaches; see §3.2.1 of [3] for detailed dis- 
cussion of these benefits. 
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Full ada |_______ Assignment ada 
Full CART |_______ Assignment CART 
Clickstream ada | _______ Forum ada 
Clickstream CART L_______ Forum CART 


Figure 1: Critical Difference (CD) diagram of week 
2 dropout prediction models. Models are plotted by 
average rank, with bold CD lines indicating statis- 
tically indistinguishable models (at a = 0.05). We 
reject Ho of equivalent performance for models not 
connected by CD lines. These results show a sta- 
tistically significant performance gap between click- 
stream features and assignment or forum features. 


probabilities of hypotheses, avoid concerns about multiple 
comparisons, and have other additional advantages [2]. 
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ABSTRACT 


In this paper, we propose a computational approach to modeling 
the Zone of Proximal Development of students who learn using a 
natural-language tutoring system for physics. We employ a 
student model to predict students’ performance based on their 
prior knowledge and activity when using a dialogue tutor to 
practice with conceptual, reflection questions about high-school 
level physics. Furthermore, we introduce the concept of the “Grey 
Area’, the area in which the student model cannot predict with 
acceptable accuracy whether a _ student has mastered the 
knowledge components or skills present in a particular step. 


Keywords 


Natural-language tutoring systems, intelligent tutoring systems, 
student modeling, zone of proximal development 


1. INTRODUCTION 


Intelligent Tutoring Systems (ITSs) support students in grasping 
concepts, applying them during problem-solving activities, 
addressing misconceptions and in general improving students’ 
proficiency in science, math and other areas [6]. ITS researchers 
have been studying the use of simulated tutorial dialogues that 
aim to engage students in reflective discussions about scientific 
concepts [4]. However, to a large extent, these systems lack the 
ability to gauge students’ level of mastery over the curriculum that 
the tutoring system was designed to support. This is also 
challenging for human tutors, who do gauge the level of 
knowledge and understanding of their tutees to some degree, 
although they are poor at diagnosing the causes of student errors 
[3]. We argue that in order to provide meaningful instruction and 
scaffolding to students, a tutoring system should appropriately 
adapt the learning material with respect to both content and 
presentation. A way to achieve this is to dynamically assess 
students’ knowledge state and needs. Human tutors use their 
assessment of student ability to adapt the level of discussion to the 
student’s “zone of proximal development” (ZPD)—that is, “the 
distance between the actual developmental level as determined by 
independent problem solving and the level of potential 
development as determined through problem solving under adult 
guidance or in collaboration with more capable peers” [7]. 
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Deriving ways to identify and formally describe the ZPD is an 
important step towards understanding the mechanisms that drive 
learning and development, gaining insights about learners’ needs, 
and providing appropriate pedagogical interventions [2]. 
Following the practice of human tutors, we propose a 
computational approach to model the ZPD of students who carry 
out learning activities using a dialogue-based intelligent tutoring 
system. We employ a student model to assess students’ changing 
knowledge as they engage in a dialogue with the system. Based on 
the model’s predictions, we define the concept of the “Grey 
Area”, a probabilistic region in which the model’s predictive 
accuracy is low. We argue that this region can be used to indicate 
whether a student is in the ZPD. Our research hypothesis is that 
we can use the outcome of the student model (i.e., the fitted 
probabilities that predict students’ performance) to model 
students’ ZPD. To the best of our knowledge, this is a novel 
approach to modeling the ZPD. Even though we focus on 
dialogue-based tutoring systems, we expect that our approach can 
be generalized and extended to other kinds of ITSs. 


2. METHODOLOGY 


In this study, we used data collected during three previous studies 
with the Rimac system to train a student model and frame the 
proposed approach. Rimac is a web-based natural-language 
tutoring system that engages students in conceptual discussions 
after they solve quantitative physics problems [5]. Rimac’s 
dialogues present a directed line of reasoning (DLR) where 
knowledge components (KCs) relate to tutor question/student 
response pairings. To model students’ knowledge we used an 
Additive Factor Model (AFM) [1]. The model predicts the 
probability of a student completing a step correctly as a linear 
function of student parameters, knowledge components and 
learning parameters. AFM takes into account the frequency of 
prior practice and exposure to skills but not the correctness of 
responses. The dataset consists of training sessions of 291 
students over a period of 4 years (2011-2015). Students worked 
on physics problems that explore motion laws and address 88 
knowledge components (KCs). The dataset contains in total 
15,644 student responses that were classified as correct or 
incorrect using the AFM student model. 


Our research hypothesis is that we can use the fitted probabilities, 
as predicted by the student model, to model the ZPD. The core 
rationale is that if the student model cannot predict with high 
accuracy whether a student will answer a tutor’s question 
correctly, then it might be the case that the student is in the ZPD. 
The student model provides predictions at the step level: each step 
consists of one question/answer exchange from the tutorial 
dialogue. A step may involve one or more KCs. The classification 
threshold (1.e., the cutoff determining whether a response is 
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classified as correct or incorrect) is 0.5 and it was validated by the 
ROC curve for the binary classifier. We expect that the closer the 
prediction is to the classification threshold, the higher the 
uncertainty of the model and thus, the higher the prediction error. 
Based on our hypothesis, this window of uncertainty can be used 
to approximately model the student’s zone of proximal 
development. We refer to this window as the “Grey Area”. 
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Figure 1. The Grey Area concept with respect to the fitted 
probabilities as predicted by the student model for a random 
student and for the various steps of a learning activity. Here 
we depict the example of a symmetrical Grey Area extending 


on both sides of the classification threshold. 


The concept of the Grey Area is depicted in Figure 1. The space 
“Above the Grey Area” denotes the area where the student is 
predicted to answer correctly and consequently may indicate the 
area above the ZPD; that 1s, the area in which the student is able 
to carry out a task without any assistance. Accordingly, the space 
“Below the Grey Area” denotes the area where the student is 
predicted to answer incorrectly and consequently may indicate the 
area below the ZPD; that is, the area in which the student is not 
able to carry out the task either with or without assistance. In this 
paper, we model the grey area symmetrically around the 
classification threshold for simplicity and because the binary 
classifier was set to 0.5. However, the symmetry of the Grey Area 
is something that could change depending on the classification 
threshold and the learning objectives. Furthermore, we do not 
propose a specific size for the Grey Area. We believe that the 
decision about the appropriate size (or shape) of the Grey Area is 
not only a modeling issue but mainly a pedagogical one since it 
relies on the importance of the concepts taught, the teaching 
strategy and the learning objectives. 


Model Behavior for Grey Areas of Different Size 
e%Cases @ #predicted correctly{%) © #predicted incorrectly{%) 


62.6 
i 60.7 60.4 
54.1 53-8 
50.7 
45.9 ie ee 
38.9 we 37.4 
26.0 
12.6 i 


Area 1 Area 2 Area 3 Area 4 Area5 
Figure 2. Model behavior (total number of predicted cases, 


cases predicted correctly and cases predicted incorrectly) 
within five grey areas of different sizes. The areas are ordered 
from the most narrow (Area 1) to the widest (Area 5). 


Figure 2 presents an analysis of the cases that are contained in the 
Grey Area. In this preliminary analysis, we examined five Grey 
Areas of different size. On one hand, choosing a narrow grey area 
to model the ZPD would limit the number of cases we scaffold 


since fewer cases would fall within the area. On the other hand, 
choosing a wide grey area would affect the accuracy; that 1s, some 
cases that could be predicted correctly would be falsely labeled as 
“orey”’. However this work does not aim to define the appropriate 
size for the Grey Area but rather to study how the model’s 
behavior may change for areas of different size. 


3. DISCUSSION 


In this paper, we present a computational approach that aims to 
model the Zone of Proximal Development in ITSs. To that end, 
we introduce the concept of the “Grey Area”. Our proposal is that 
if the model cannot predict the state of a student’s knowledge, it 
may be that the student is in the ZPD. We envision that the 
contribution of the proposed approach, besides its novelty (to the 
best of our knowledge there is no quantified operationalization of 
the ZPD) will be in defining and perhaps revising instructional 
methods to be implemented by ITSs. Choosing the “next step” is a 
prominent issue in the case of dialogue-based intelligent tutors. 
Not only should the task be appropriate with respect to the 
background knowledge of the student, but it should also be 
presented in an appropriate manner so that the student will not be 
overwhelmed and discouraged. To address this issue, we need an 
assessment of the knowledge state of each student and insight into 
the appropriate level of support the student needs to achieve the 
learning goals. This is described by the notion of ZPD. It is 
evident that if we can model the ZPD then we can adapt our 
instructional strategy accordingly. A limitation of our work is that 
we have not yet been able to conduct a rigorous evaluation of our 
approach; however, plans to validate our modeling methods are 
being developed. Our immediate plan is to carry out extensive 
studies to explore the proposed approach to modeling the ZPD 
further, as well as to better understand the strengths and 
limitations of using a student model to guide students through 
adaptive lines of reasoning. 
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ABSTRACT 


Students experience considerable challenge in STEM coursework 
and many struggle to earn the grades needed to move forward in 
their majors. Interventions informed by prediction models can 
support learners to ensure successful completion of STEM courses 
and entry into the STEM workforce. In order to accurately target 
intervention efforts, we developed a prediction model based on log 
data generated by student use of content hosted on a learning 
management system (LMS; Blackboard Learn) course site in the 
first weeks of the course. The prediction model employed a forward 
selection logistic regression algorithm (with 10-fold cross 
validation) trained on four semesters of data, and provided 
instructors the opportunity to message students and provide 
learning support before the first major exam, potentially 
intervening before onset of poor performance. The best fitting 
model was used to identify students unlikely to obtain the required 
grade (B or better) in the course. Among 106 students predicted to 
perform poorly, 63 received a message from the instructor’s 
account that referenced an upcoming exam and linked students to 
supportive materials. Messaged students who accessed learning 
supports outperformed non-messaged but eligible students (n = 43) 
on each of five subsequent exams throughout the semester (ds = .64 
- .88). Fifty-eight percent earned a B or better, compared to 25% of 
non-messaged peers predicted to earn a C or worse. This study 
affirms that data-driven early alert messages can provide targeted 
support and boost achievement in challenging STEM courses. 


Keywords 


Learning management system, Prediction modeling, Early warning 
system, STEM learning, learning sciences 


1. INTRODUCTION 


Learning management system (LMS) have become a central tool in 
higher education. Logs of learning events can be combined with 
achievement data in order to identify (un)productive patterns of 
events and predict the achievement of future students based on their 
behavioral match to prior students who achieved certain levels of 
performance [1]. 


2. METHODS 


The university LMS, Blackboard Learn, captures and records 
student use of materials hosted on course sites. Student activity and 
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achievement data (N=510) from 4 semesters of an undergraduate 
calculus course taught by two instructors (identical content, 
assessments) from fall 2014 to spring 2016 informed prediction 
modeling (Table 1). 


Table 1. Training and testing data 


Fa 2014 & Sp 2015 (n=167) | Fa 2015 (n=96) 
Fa 2014 & 2015 (n=161) | Sp 2016 (n=86) 


Instructor A Instructor A 
(Fa 2014 & Sp 2015) (Fall 2015) 
Both Instructor B Instructor B 
(Fa 2014 & 2015) (Spring 2016) 
(n=328) (n=182) 
Developing the prediction model went through two main phases, 
training and testing process. In the training phase, logistic 
regression with forward selection was used to build the prediction 
model, and the problem of overfitting was examined through 10- 
fold cross-validation. In the testing phase, the most accurate 
prediction model developed in the training phase was applied to the 


testing data set to assess potential overfitting and ensure 
generalizability to future students’ data [2]. 


Based on the Kappa («) and recall, the best 3-week prediction 
model developed through the training and testing phases was then 
applied to data from fall 2016 Calculus students to identify students 
in need of an early alert message that provides learning support. 


In order to investigate the effect of messaging identified students, 
those identified as likely to perform poorly by the prediction model 
were randomly divided into two groups, a “Message” group who 
would receive a message that focused attention on an upcoming 
exam and some useful learning resources (Figure 1) and a “No 
Message” group who would not. 


Hi [Name]! 

Our first course exam is coming up on Friday... 

1. The first is a one-page summary of advice from students who have 
completed the course with an excellent grade in the past... 


. Aset of learning modules called "The Science of Learning to Learn." These 
modules describe learning strategies you can use with our course 
materials... 


Figure 1. Message to students 
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3. RESULTS 

Among three models, the prediction model based on Instructor B’s 
students produced the best Kappa (k = 0.26) and recall (73%) 
values. The model accurately identified > 7 in 10 students who 
would ultimately earn less than 80% of points (1.e., a C or Worse). 
We thus moved forward to the testing phase using the Instructor B 
model (Table 2) and for the prediction and messaging phase. 


Table 2. Prediction models in the training and testing phase 


True: Predicted 
— >) — 
ae ieee | es 


Training set 


Accuracy (%) 
Precision (%) 


Recall (%) 


Instructor A 
Instructor B 9 63 |.26| 63 | 52 | 73 


(Fa 2014 & 2015) 
| Both | 97 | 69 | 63 | 96 [19] 59 | 59 | 60 | 
Instructor A (Fa 2015) | 16 | 25 | 9 | 46 | 24] 65 | 65 | 84 | 
p Both | 35. | 48 | 20 | 79 | 21 | 63 | 62 | 80 | 


In the testing phase, attributes and their weights achieved from the 
training phase were applied to the testing data to examine risk of 
overfitting. The prediction model resulted in the Kappa value of .20 
or more for all testing sets. In addition, values of recall were 84, 75, 
and 80 respectively, all of which were greater than result in the 
training phase. We thus retain the Instructor B model for the 
prediction and messaging phase. 


Upon sending the message four days prior to the first exam, student 
access of recommended resources and performance on exams were 
tracked throughout the remainder of the semester. For all exams 
throughout the semester, the students in treatment group (.e., 
Message & Access) performed better than those without any 
treatment (No Message, No Access; p <.05). In addition, effect 
sizes for all exams were more than “medium” (d > .5) (Table 3). 


Table 4. Contingency Table 


Predicted C or Worse 


Control 

11 (58%) 7 (25%) 
True 

8 (42%) 21 (75%) 


Total 


Table 4 shows the proportion of students who performed better than 
(1.e., B or Better) vs. as projected (i.e., C or Worse). A Chi-square 
analysis indicated that a significantly greater proportion of students 
(58%) in the Message and Access group earned a final grade of B 
or better, x7 (47) = 5.18, p = .02. Only 25% of students predicted to 
earn a C or worse outperformed their prediction in the No Message, 
No Access control group. 


4. DISCUSSION 


In this study, those who received a brief email message from a 
course instructor and accessed a learning resource outperformed 
non-messaged students on all exams. Results thus indicate that 
data-driven interventions can be provided relatively early in the 
semester — six weeks earlier than the typical data-driven indicator 
of poor future outcome: a week 9 response to midterm grades. The 
>200-word message required only a minute or two of a typical 
student’s time, and a visit to the advice page — the common material 
accessed — required only slightly more time investment from 
messaged students (~900 words). 


The benefits of receiving a message and accessing the resources it 
recommends were substantial: 12% on all exams, or a full letter 
grade. Surprisingly, few students heeded the early alert as intended; 
30% of messaged students accessed supportive materials, 
confirming that obtaining students’ attention is a clear challenge to 
realization of the benefits messaging can provide. Messaging 
efforts thus clearly require improvement. We must also consider 
how to provide more adaptive message contents based on students’ 
likelihoods of poor performance, or different supports based on the 
maladaptive practices summarized by features present in students’ 
prediction models. More specific feedback about the kinds of 
learning behaviors that require adjustment may further increase 
messages’ effects. 
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ABSTRACT 


Clustering analysis in the context of education is important 
for determining the effectiveness of group activities especially 
when participants freely rotate between groups such as in a 
gallery exhibit or other informal learning space or set-up. In 
this paper, we cover a method of applying Gaussian Mixture 
Models to two-dimensional data. We further describe the 
analysis procedure, and the success of implementing this anal- 
ysis using simulated data and real data. Finally, we discuss 
some educational applications as well as future directions for 
this research. 


Keywords 

Gaussian Mixture Models, MCMC, Gibbs Sampling, Real- 
Time Location System, Informal Learning Spaces, Learning 
Analytics, Dynamic Mixture Model 


1. INTRODUCTION 

Real-time locating systems have become increasingly popular 
and are predicted to be more widely adopted in informal learn- 
ing institutions such as libraries, museums, and after school 
spaces in the next few years [2] [4]. Location intelligence and 
contextually relevant information can inform dynamically 
customized information and meaningful learning analytics for 
both learners and educators based on visitors and/or learners’ 
location [3]. Such data are especially useful to understand 
social interactions in informal learning events. Therefore, it 
is essential for researchers to develop data mining methods 
to more efficiently and effectively explore real-time location 
data of learners. 


Gaussian Mixture Models (GMM) are very useful for analyz- 
ing two-dimensional data which may be clustered into groups 
such as that collected by a real-time locating system in an 
informal learning space. To estimate the parameters of the 
GMM we employ a Markov Chain Monte Carlo method of 
Gibbs sampling [1] whose stationary state is the posterior 
distribution of the mixture model. This method applied to 
a frozen snapshot of the two-dimensional real-time location 
tracking data allows us to gain information about the groups, 
such as group membership, group location, and internal 
group dispersion, based only on the tag position data. Other 
algorithms such as k-means clustering may similarly cluster 
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two-dimensional data but are non-parametric whereas Gibbs 
sampling is parametric. 


2. DATA ANALYSIS 

2.1 Simulations 

To test the Gibbs sampling process and our R code we 
have drawn a set of location data points from bivariate 
normal distributions centered around three different centers 
(ui = (15,15), we = (15,0),~3 = (0,15)) with a common 
covariance. We observed the latent parameters of our Gibbs 
sampler reaching a stationary state in less than 100 iterations. 
In Figure la we generate estimated points using the estimated 
group centers and covariance and perform kernal density 
estimates to generate the coverage contours plotted over the 
original generated data. The percentage of estimated points 
outside the contours is marked on the contour lines. In this 
case we see that for 120 data points, a small number of the 
data lie outside of the 99.5% percent coverage contours. We 
can also verify the results by comparing the generating values 
for the centers and covariance with the estimated values. 


2.2 Applications on Real Data 

The real data were collected at an Edlab meeting at an 
innovative learning space: the Smith Learning Theater at 
Teachers College Columbia University. The Smith Learning 
Theater features technologies such as the Quuppa 7™ real- 
time locating system, installed to return measurable results 
and provide feedback to organizers and facilitators. In this 
meeting, 15 EdLab members wore Quuppa real-time locating 
tags and freely explored four stations of augmented/virtual 
reality apps in order to provide reviews for a national edtech 
competition. Applying the Gibbs Sampling method over the 
real data we again obvserved convergence within just 100 
iterations. Again the coverage contours are drawn onto the 
plot of the positions in Figure 1b. In this analysis we did 
not have previous knowledge on the station device locations 
likely to be correlated with the group centers. However, we 
can still verify the success of the algorithm by noting that 
the data points are largely within the ninety-five percent 
coverage region. As such, our method returns accurate group 
information even with a small dataset. 


3. DISCUSSION 
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Figure 2: Linear Correlation Between Computational Time 
and the Number of Data Points 


3.1 Educational Research and Applications 

Our method has the limitation that the expected number 
of groups must be specified prior to performing the Gibbs 
sampling. This quantity can be available for events where 
group work takes place, or participants move around through 
different stations. In such an event our analysis can be 
implemented repeatedly over a series of consecutive discrete 
snapshots covering a period of time. By observing the group 
membership at each snapshot, the educator can determine 
information about who moved together as a group, or who 
moved mostly independently. Common group membership 
can be denoted in an adjacency matrix for the tags where the 
value for each index (i, 7) is the number of snapshots in which 


two locating tags yi, y; shared the same group assignment. 


This approach has the potential to provide information about 
whether the learning space or activity was better suited for 
group learning or independent learning and the preferences 
of each participant to remain with the same group of people 
or move about with different people. In other events where 
group work may be taking place one can easily determine 
the amount of cross-group collaboration during a period of 
time by again looking at the cumulative group assignment 
data. 


3.1.1 Feasibility Analysis 

The implementation of the Gibbs Sampling algorithm takes 
linear O(N) time where N is the number of position data 
points in a single snapshot. We can generate N position 
data points and record the time elapsed for M iterations 
and visualize the linear relationship in Figure 2. Given 
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an hour long event with 500 participants, covered by 360 
snapshots, the linear model suggests that one could perform 
250 iterations of the sampler over every snapshot in under 
twenty minutes. As such implementation of our method is 
feasable for most educational contexts. 


3.2 Future Work 


While our model is useful to see the group information within 
a snapshot of real-time location data, we believe that more 
important data will arise from extending our current mix- 
ture model to a Dynamic Mixture Model (DMM) [5]. In 
such a DMM, the group distribution of each snapshot would 
be dependent on the previous one. According to Wei et al. 
(2007) the assumption that two consecutive snapshots are 
dependent can allow us to analyze important patterns that 
would otherwise be missed in discrete snapshot analysis. By 
incorporating the temporal component, we expect to more 
accurately model transitions between groups. The applica- 
tion of our method is especially valuable in informal learning 
spaces as many learning events in these spaces encourage free 
exploration and group interactions, and evaluating learners’ 
engagement and social group dynamics is challenging using 
other traditional research methods. 
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ABSTRACT 


Pressible is a school blogging and content management system 
developed by EdLab at Teachers College Columbia University. In 
this paper, social network analysis and natural language 
processing with Latent Dirichlet Allocation topic model 
approaches were utilized to gain insights into Pressible, to explore 
four developmental stages of a college-wide social network and 
their associations with blog content. The results showed that 
professors who developed courses became the most influential 
persons in the network. Students extended the online discussion 
topics beyond the scope of course topic set by professors. 


Keywords 
SNA, NLP, Topic Model, LDA 


1. INTRODUCTION 


EdLab adapted the Wordpress Content management systems 
(CMS) framework and developed Pressible for the Teachers 
College (TC) community in 2008. It was designed for fast content 
delivery, minimization of users’ time spent managing technology, 
and developing connections between users (Zhou, 2013). From 
the perspective of social constructivist theory, people 
communicate, contribute and acquire knowledge through social 
engagement and discussion of topics (Vygotsky, 1978). People 
also gain knowledge online via connecting information (Siemens, 
2004). Massive Open Online Courses (MOOCs) provide more 
opportunities for people to study for personal intellectual growth 
(Kizilcec et al., 2017). Social factors from online discussion 
forums (Rose, et al., 2014) and engaging in higher order thinking 
behaviors enhanced learning in MOOCs (Wang, et al., 2016). 
Higher Education utilizes academic blogging to facilitate social 
networking, self-directed learning, and collaboration. Simulation 
studies on the blogosphere indicate that improved management 
facilities on course blogs positively affect the density and 
connectedness in learning networks (Wild & Sigurdarson, 2011). 
This study utilized social network analysis (SNA) to investigate 
human-human interaction and the development of social 
connections on this blogging platform. Next, Latent Dirichlet 
Allocation (LDA) topic model method was applied to understand 
human-information interaction during different developmental 
stages of Pressible. This study provides an exploratory 
examination of four developmental stages of an online learning 
community in a school blogging system. 


2. METHODOLOGY 
2.1. Participants and Data Collection 


The data were collected from the entire Pressible database and 
contained 3598 users and 594 sites, with 50422 posts in total. The 
specific aim of this study was to explore the social network and its 
association with content creation. Only the interactions between 
registered IDs were counted as valid connections. After the 
reconstruction of the database for SNA, there were 172 blogs with 
data on a total of 11146 connections and 429 interactive users. 
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2.2. Social Network Analysis 


SNA is a method to analyze the connections, relationships, and 
interactions between individuals and communities in the 
collaborative social network, expressed as the node and edge 
diagrams (Wild, 2016; Slater et al., 2017). In this study, R 
package igraph (Csardi & Nepusz, 2006) constructs, modifies and 
calculates the social networks. Density measures the proportion of 
contacts observed between pairs of nodes in the network; Eigen 
centrality measures the importance of a node’s network by 
weighting its top connecting nodes’ indegree and outdegree 
centrality (Daniel, et al., 2010). 

2.3. Latent Dirichlet Allocation Topic Modeling 

To analyze the content of comments and posts in the blogs, LDA 
topic modeling was utilized to discover and infer the general 
topics by scanning the words and their distribution probabilities 
within documents (Blei, et al., 2003). The R package tm was used 
to construct the corpus for text mining. The tm package removes 
spaces, stop words, numbers, spaces, and punctuation, converting 
the words to lower case and roots to construct a term-document 
matrix, which allows analysis of individual words in the corpus 
(Feinerer & Hornik, 2015; Lang, 2017). The R packages 
topicmodels and tidytext were utilized to calculate the term 
frequency, construct the inverse document matrix, remove the 
uncommon terms, find the most common words for individual 
topics and group the documents by generated topics (Grin & 
Hornik, 2011; Lang, 2017; Silge & Robinson, 2017). 


3. RESULTS AND DISCUSSIONS 


3.1. Social Network Development 

Descriptive statistics analysis on yearly data was conducted to 
show the general social network activity in Pressible by 
developmental stages (Tables 1). The results indicate that this 
blogging system shifted from a development stage (beginning to 
2010 Summer), to a stable growth stage (2010 Fall to 2012 
Summer), a rapid growth stage (2012 Fall to 2015 Summer), into 
a decline stage (2015 Fall until now). The active member numbers 
increased from the development stage to rapid growth stage and 
decreased in the decline stage. Their engagement rates as average 
connection numbers increased from development to the rapid 
growth stage, which also dropped at the decline stage. Therefore, 
the number of active members and their engagement rate 
determine the growth of this online social learning community. 
The density of the social network among active members 
decreased while the network was growing from 2011 to 2015 
(Fig. 1), indicating that the network became decentralized as more 
active members joined. Most of the participants were students. 
They became less active in interactions on Pressible after 
graduation. New students joined the social network and formed 
new social centers. Thereby, the global social density decreased 
because of the dynamic student community (Fig. 1). As more 
professors built their courses on Pressible, more active students 
joined this online learning community for discussions and made 
meaningful connections. Recruiting more professors to take 
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advantages of Pressible for its online course creation features is a 
key to maintaining the rapid growth of this social network. 
Table 1. Descriptive Statistics by Developmental Stage 
Ave. Ave. Ave. conn. Top popular 
Stage conn. active IDs per IDs topic of the 
(/year) (/year) (/year) stage 


ment 


video game 


9 
nN 
Growin | 3% 
ae 2021 118.7 think and know 
| Destine | 993 | 104 | 96 | sertormance 
performance 


3.2. Most Influential Members and Topic 


Interaction Analysis by Developmental Stage 
To determine the optimal number of topics of the whole Pressible 
database, the perplexity values of models were calculated. The 
LDA topic training model was constructed based on 10000 
documents with the range of 2 to 50 topic numbers. The other 
1146 documents are used to test the model with the calculation of 
perplexity and entropy. Based on the perplexity of testing data, 30 
is the optimal topic number for this dataset. 
During the developmental stage, the library staff was the most 
active members in the network. Their online discussions focused 
on the topics: “video game, education’, indicating that library 
staff was using Pressible as a communication tool to share 
thoughts and discuss education media. 
During the stable growth stage, a TC professor (ID: 1490) from 
the music education program built his courses on Pressible for 
three years (2011 to 2013), and he continuously received the 
highest eigen centrality score for three years. During the stable 
growth stage, the popular topics became focused on education. 
People who talked about “think and know” were also interested in 
“video game” at this stage. 
During the rapid growth stage, the professor with ID 3132 brought 
new students into this blogging system though his courses 
Creativity & Problem Solving in 
Music Education. It was a 
course extended from the 
® materials developed by the 
professor with ID 1490, with the 
same topic “read” and high- 
frequency words “music, read” 
for most of the posts. This was 
the pedagogy course to meet the 
New York State and national 
teacher preparation standards. 
Individuals’ topic co-occurrence 
indicated a robust network in the 
rapid growth stage (Fig. 2). People 
a talked about the topics of 


99 66 


“creativity”, “music composition’, 
“Jazz”, “social education’, “learn 
and think’’, “experience and life” 


and “teach and learn” at high co- 


wre 


Figure 1. Social Network 
Density by Year. 


oo ee occurrence frequencies (above 30). 
In the decline stage, the topic co- 
iol ie occurrence network dropped in 
Figure 2. Topic Co- topic connection intensity which 


might be due to less active 
members in the overall network 
(Table 1). This finding indicated 


occurrence frequency in 
the rapid growth stage 


that more active members encouraged online discussions with 
more diverse topics. In course blogs, students extended discussion 
topics to the perspectives that they care about: “music learning, 
music playing, social education, creativity and experience and 
life’, beyond the scope of the professor’s set topic “read”’. 


4. IMPLICATIONS 


This study identifies and explores four developmental stages of 
the social network: development, stable growth, rapid growth, and 
decline. The SNA and topic model analysis results imply that the 
influential people will bring new communities into the social 
network by sharing the content of the hottest topics. Deliberately 
recruiting more influential people into the social network would 
accelerate its transition from the stable growth stage to the rapid 
growth stage. 


5. REFERENCES 

[1] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent 
Dirichlet allocation. Journal of Machine Learning Research, 
3(Jan), 993-1022 

[2] Csardi, G., & Nepusz, T. (2006) The igraph software 
package for complex network research, InterJournal, 
Complex Systems 1695. 2006. 

[3] Daniel, M., Messing, S., Nowak, M., & Westwood, S. J. 
(2010) Social Network Analysis Labs in R. Stanford 
University 

[4] Feinerer, I., & Hornik, K. (2015). tm: Text Mining Package. 
R package version 0.6-2. 

[5] Grin, B., & Hornik, K. (2011). “topicmodels: An R Package 
for Fitting Topic Models.” Journal of Statistical Software, 
40(13), pp. 1-30 

[6] Kuzilcec, R. F., Pérez-Sanagustin, M., & Maldonado, J. J. 
(2017). Self-regulated learning strategies predict learner 
behavior and goal attainment in Massive Open Online 
Courses. Computers & Education, 104, 18-33. 

[7] Lang, C. (2017) HUDK 4051: Learning Analytics: Process 
and Theory. Columbia University. New York 

[8] Rosé, C. P., Carlson, R., Yang, D., Wen, M., Resnick, L., 
Goldman, P., & Sherer, J. (2014). Social factors that 
contribute to attrition in MOOCs. In Proceedings of the first 
ACM conference on Learning (pp. 197-198). ACM 

[9] Siemens, G. (2004). Connectivism: A learning theory for the 
digital age. elearnspace. Retrieved December 12, 2007, CHI 
'00. ACM, New York, NY, 526-531 

[10] Silge, J., and Robinson, D. (2017) “Text Mining with R: A 
Tidy Approach” O'Reilly Media 

[11] Slater, S., Joksimovic, S., Kovanovic, V., Baker, R., & 
Gasevic, D. (2017) Tools for Educational Data Mining: A 
Review. Journal of Educational and Behavioral Statistics. 
2017, Vol. 42, No. 1 p85-106 

[12] Vygotsky, L. (1978). Mind and society: The development of 
higher psychological processes. Cambridge, MA: Harvard 
University Press. 

[13] Wang, X., Wen, M., & Rosé, C. P. (2016, April). Towards 
triggering higher-order thinking behaviors in MOOCs. In 
Proceedings of the Sixth International Conference on 
Learning Analytics & Knowledge (pp. 398-407). ACM 

[14] Wild, F., & Sigurdarson, S. E. (2011). Simulating learning 
networks in a higher education blogosphere—at scale. In 
European Conference on Technology Enhanced Learning 
(pp. 412-423). Springer Berlin Heidelberg 

[15] Wild, F. (2016). Learning analytics in R with SNA, LSA, 
and MPIA. Springer. 

[16] Zhou Z. (2013) Connecting Teacher Bloggers: Unleashing 
the Educational Power of Wordpress 


Proceedings of the 10th International Conference on Educational Data Mining 363 


Wachine Leaming Political Methodology 


PALIT = FALLIG Ss FALLG SPR GIE 
Fauui2 FALL FALLS 


Al Principles 


a oe ee ee 


FALL Falta FALI§ 
Fallis FALLS: 


Decision Analysis 


FALLIS FALLi§ 


Cultural Herttage Cal and devokpmeat bldfogy 


Enrolled Students 


Vii 


12 3 4 5 6 7 8 9 10 11 12 


0 = 


Molecular Biology 


FALLIS = FALLIS 
FALE? = PALLIA 


Economic Aralysis 


FaLLid FALLIE «= WANTERIB 
WATERS = GAMER 


Computer Yision 


WANTERIS 


WWTEATA BRINGS 


International Urkartzxtion 
if 


Fallid 
a 


Bathe Pomtee feaie 


12 posts; 2 weeks 


ee ee ee ee 


5 weeks 


23 4 5 6 7 8 Q 12 


Cc 
ise] 
W 
= 
VY 
= 
~ 
= 
2) 
uz 
= 
a 
= 
> 
W) 
> 
a) 


Untangling The Program Name Versus The Curriculum: An 
Investigation of Titles and Curriculum Content 


R. Wes Crues 
University of Illinois 
Dept. of Educational Psychology 
1310 South Sixth Street 
Champaign, Illinois 
crues2@illinois.edu 


ABSTRACT 


This investigation focuses on the relationship between skills 


taught during business programs and whether the skills taught 


relate to the title of the program, as deemed by subject- 
matter experts. We hone-in on formal degree and non-degree 
programs in small business education, entrepreneurship ed- 
ucation, or a blend of these two to determine if the name of 
the program is related to the skills taught in said program. 
We use a collection of excerpts from college catalogs, which 
are all descriptions of the formal academic programs. We 
then use k—means clustering to group program descriptions 
into interpretable clusters. We discuss the findings from the 
cluster analysis. 


Keywords 


text mining, clustering, higher education, business education 


1. INTRODUCTION 


Major academic disciplines are typically collections of finer- 
grained specialties; for example, a computer science depart- 
ment might consist of experts in human-computer interac- 
tion, artificial intelligence, algorithm design, among others. 
Colleges likely have departments with similar names, but we 
want to understand if similarly named degree programs at 
different universities equip students with similar skills. To 
discern whether or not this task is tractable, we used a col- 
lection of program descriptions from college catalogs about 
programs claiming to teach students entrepreneurship, small 
business, or a blend between these two curriculum areas. 
These definitions are used throughout: 


e A program description is at least one, but often com- 
poses a few paragraphs, which delineates skills taught 
in programs, and might provide some learning goals 
and a listing of courses; 


e Entrepreneurship is defined as “ trying to identify op- 
portunities and putting useful ideas into practice” [1] 


Proceedings of the 10th International Conference on Educational Data Mining 


Table 1: Distribution of Program Descriptions 


Program Label Degree/ Non-Degree 
Entrepreneurship 247/197 


Small Business & E-ship 42/40 
Small Business 20/59 
SI/58 


(p. 6); 


e and, small business management is “the ongoing pro- 
cess of owning and operating an established business” 


[3] (p. 28). 


Our study explores whether we can use text clustering to 
identify a clear distinction between these two areas of busi- 
ness education, determine if there are differences between 
two-year and four-year programs, and whether there are dif- 
ferences between degree and non-degree programs. 


2. METHOD 


A research team manually assembled a collection of 697 pro- 
gram descriptions from college catalogs for institutions lo- 
cated in the United States. Research assistants went to col- 
lege websites and manually extracted text from published 
college catalogs online. The initial list of programs was 
derived from the 2013 Integrated Postsecondary Education 
Data System (IPEDS) maintained by the United States De- 
partment of Education. After filtering institutions which 
did not have any business programs, a random sample of 
programs arrived at the collection used. 


Program descriptions spanned programs focusing in entrepreneur- 


ship, small business management, or a blend of the two. Ad- 
ditional program descriptions were collected which were con- 
sidered special focus programs; these were programs which 
teach a specific skill set on operating a business (exam- 
ples include funeral home management to hair weaving and 
braiding entrepreneur). We also considered formal degree 
(e.g., associates and bachelor degrees) or non-degree pro- 
grams (e.g, certificates or specializations), and whether the 
home institution is public or private, for-profit or not-for- 
profit, and whether the institution is a 2-year, 4-year, or 
4-year and beyond institution [5]. Table 1 presents the dis- 
tribution of program labels and whether the program is a 
degree or non-degree program. 


366 


2.1 Preprocessing Program Descriptions 
Program descriptions were transformed into raw text for- 
mat, tokenized into unigrams, except for a few words. A few 
bigrams and trigrams were specified using knowledge from 
a domain-expert, for example, business plan(s), social en- 
trepreneurship, home based business, and venture capital. 
Punctation, numbers, and top words were removed using 
the pre-defined English stop word list in the “tm” package 
in R [2]. We used stemmed words by using the Porter stem- 
ming algorithm [6]. We used binary indicators to determine 
whether a term was present in each program description 
when constructing the document-term matrix [4]. 


2.2 Corpus Statistics 

Our initial document-term matrix contained 7799 unique 
terms with a sparsity of 99%. We removed very frequent 
terms deemed to have no substantive value by a domain 
expert. Due to the nature of the corpus (i.e., program 
descriptions), words such as catalog, college, semester, re- 
quirements, and introduction, among others, were excluded. 
Eventually, we used the “removeSparseTerms” function in 
the “tm” package in R [2], which resulted in a document-term 
matrix with 16 unique terms, however, still 70% sparse. 


2.3. Program Description Clustering 

We utilized k—means because this clustering technique was 
favored in prior studies [7]. We experimented with vari- 
ous numbers of centroids, and after discussions with domain 
experts, we determined k = 10 was an optimal solution. 
The domain expert believed this solution provided an in- 
terpretable and reasonable grouping of programs. Specifi- 
cally, the distribution of whether the program was an en- 
trepreneurship, small business, a blend of these, or a special 
focus program, coupled with their expectations of distribu- 
tion of formal degree programs versus non-degree programs. 
More than ten centroids resulted in clusters containing less 
than five documents, while less than ten resulted in a solu- 
tion which did not provide what domain experts believed to 
be the most interpretable. 


3. RESULTS 


Five of the clusters exhibited a focus on teaching entrepreneur- 


ship in the context of having an idea, creating a start-up, 
with the intention of scaling the business into a large enter- 
prise. Within these clusters, two clusters had words indi- 
cating programs might teach entrepreneurship to equip stu- 
dents to solve global problems and health concerns. Words 
indicating entrepreneurship might be taught to professionals 
in fields besides business (i.e., law and engineering) appeared 
in one cluster. One cluster appeared to teach general busi- 
ness skills, without a clear focus on entrepreneurship or small 
business. Another cluster contained special focus programs, 
which seek to prepare students for a specialized, technical 
career, such as a travel agent or carpenter. T'wo clusters con- 
tained small business programs, where one focused on keenly 
on running ones’ own business, while the other included this 
while teaching students to innovative. One cluster contained 
very detailed program descriptions from one institution. 


4. DISCUSSION & CONCLUSIONS 


We found the definition of entrepreneurship which pertains 
to creating and expanding new enterprise appeared to be 
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almost exclusively in four-year colleges, especially research 
universities. In contrast, small business management and 
operating a small business were taught almost exclusively 
at two-year colleges. A few of the two-year colleges also had 
many specialized programs in applied fields, such as the cos- 
metology; these types of programs were nearly exclusive to 
two-year colleges. Another element of entrepreneurship is 
creativity and innovation. These skills, specifically innova- 
tion, seemed to be taught primarily in the four-year sector. 
The programs that considered themselves a blend tend to 
focus more on small businesses than entrepreneurship. We 
found innovation and these skills to be taught more in de- 
gree. On the other hand, skills related to managing a small 
business were in non-degree programs. 


From our findings about entrepreneurship and small busi- 
ness education, we generally found labels of programs match 
the skills one would expect to learn given the name of the 
program. However, one cluster in our analyses did not in- 
dicate skills in the targeted areas were being specifically 
taught. A limitation of our study is program descriptions 
vary in length and detail, which might be problematic for 
clustering. Our further work plans to consider whether skills 
taught have changed over time; for example, are skills being 
taught today the same skills taught a decade ago? 
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ABSTRACT 


Text mining has been used in various fields including education. 
Using unsupervised sentiment analysis combined with a clustering 
algorithm, we discovered 2 emerging clusters of learning 
characteristics (traditional (T) and experiential (E)), and 
correlations among learning attitudes such as motivation, peer 
relationship and positive attitude. We found a positive correlation 
between social learning and peer relationship (p<0.005), but 
negative between social learning and negative attitude (p<0.05) in 
E. Social learning was positively correlated with positive attitude 
(p<0.001) in T. 


Keywords 
Text mining, clustering algorithm, sentiment analysis, motivation, 
engagement 


1. INTRODUCTION 


Studies have shown that attitudes are related to motivation, 
engagement and outcome in learning. When learners have positive 
attitude, they would spend more time engaging in learning [5, 9]. 
Difference in students with positive attitude and motivation in e- 
learning settings was observed [6]. Students with boredom have 
poorer learning outcome than those with frustration [1]. Hence, 
sentiment analysis could be used to harness learning attitudes. 


Recently, machine learning methods in natural language 
processing have become prevalent, while there are many training 
datasets for supervised learning algorithms. However, the task of 
opinion mining without such dataset can be a challenge. We 
combined one symbolic technique for an unsupervised machine 
learning with clustering algorithm to discover emerging patterns 
among texts written in Thai that could reflect student’s learning 
attitudes. Our findings demonstrated how such approach could be 
useful in exploring and understanding relationships among 
learning attitudes. 


2. METHODS 
2.1. Data Acquisition 


Our subjects were 83 freshman undergraduate students (M:F = 
62:21) (average age = 17.2) in Robotics and Automation 
Engineering, at King Mongkut’s University of Technology 
Thonburi. They consented to participate in the study. 


Warasinee Chaisangmongkon 
Institute of Field Robotics 
King Mongkut’s University of 
Technology, Thonburi'! 
(+662) 470-9716 
warasinee.cha@kmuit.ac.th 


Chanikarn Wongviriyawong 
Institute of Field Robotics 
King Mongkut’s University of 
Technology, Thonburi' 
(+662) 470-9717 
chanikarn @fibo.kmutt.ac.th 


This data set was collected while students were taking same 
classes. Students wrote in Thai about what they learned each week 
for all 14 weeks. 


2.2. Data Analyses 

We used an open source Lexitron dictionary (NECTEC, 2006) as 
word database in Thai and an open source algorithm Lexto 
(NECTEC, 1994) to tokenize texts into longest words possible. 
We had 383 entries. On average, each entry had 124.3 words. 


Word frequency was calculated for each student as the ratio of the 
number of times each unique word appeared in any learning 
journal and the total number of words appeared. Irrelevant words 
(prepositions, conjunctions, and generic verbs and nouns) or 
words that appeared less than 20 times in all entries were filtered 
out. Negation and irrealis phenomena, out-of-topic sentences, or 
irony and sarcasm were not treated in our analysis. We performed 
several clustering algorithms on the distance matrix with various 
initial conditions and different number of clusters (2, 3, or 4) to 
determine if any pattern of word clusters could emerge. 


Among frequently-used words, instructors chose words that 
represented these six attitudes: 1) positive relationship with others 
(Peer relationship), 2) desire to improve oneself (Motivation), 3) 
positive emotion (Positive attitude), 4) negative emotion 
(Negative attitude), 5) engagement in learning on one’s own 
(Solitary learning), and 6) engagement in learning that involves 
others (Social learning). The associated words were also evaluated 
by another group of students to indicate levels of congruity of 
each attitude’. The results are shown in Table 1. We calculated a 
student’s attitude score to be the sum of percentage of word 
frequency for each word associated with each of the 6 attitudes. 
Pearson correlation coefficient and p-value of the correlation were 
computed between any two attitudes. Correlation analyses were 
performed independently for each cluster. 


3. RESULTS AND DISCUSSION 


We found that 2 clusters emerged, yielding the most consistent set 
of words. The first cluster contained words such as take exams, 
read books, problem sets, formula, lessons, math, writing, 
calculus, physics, language, etc. The second contained words such 
as human being, people, work, see, team, fun, talk, play, like, 


' King Mongkut’s University of Technology, Thonburi’s address: 126 Pracha Uthit Rd, Bang Mot, Thung Khru, Bangkok 10140 Thailand 


* The data were collected from 28 native Thai speakers (average age = 20.18). They were asked to rate how each pair of words and an 
attitude was meaningfully or semantically related (e.g. Peer Relationship vs. Group) in a 5-point Likert scale. 
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group, together, etc. The first cluster was labelled T for traditional 
and the second, E for experiential. Although initial conditions and 
clustering algorithms were varied, these two clusters emerged. 


Table 1. Words Associated with 6 attitudes and their rating? 
(mean score and standard deviation in parentheses ) 


R hate hj group, talk, help, together, team, 3.87 
Caen help each other, we, everyone, etc. (0.48) 
3.76 
(0.4) 
3.64 
(0.37) 
3.01 
(0.45) 
3.34 
(0.77) 
3.71 
(0.31) 


Motivation was positively correlated with solitary learning (R=0.4 
(T) and 0.55 (E); p<0.05). It could mean that for T, when one 
desires to improve oneself, one engages in learning even on one’s 
own. Our result supports a previous finding that motivation and 
engagement were correlated [3, 10]. Such correlation for E might 
be because when one enjoys learning with others, their motivation 
increases. Previous studies showed that people who reported 
feeling happy were engaged in social activities more often and 
that sociability was a strong predictor of life satisfaction [2, 7]. 


Motivatio improve, practice, better, goals, 
n development, improvement, etc. 


Positive 
Attitude 


fun, enjoy, like, happy, funny, 
good, excited, etc. 


stressed, confused, sleepy, slow, 
difficult, do not understand, etc. 


Negative 
Attitude 


exams, formula, scores, books, 
grades, study, responsibility, etc. 


Solitary 
Learning 


Social hands on, experiment, project, 
Learning communication, participate, etc. 


Additionally, for E, motivation was positively correlated with 
social learning (R=0.42, p<0.05); social learning was positively 
correlated with peer relationship (R=0.6, p<0.005), but negatively 
correlated with negative attitude (R=-0.44, p<0.05). For T, social 
learning was positively correlated with positive attitude (R=0.55, 
p<0.001). Relationships with peers are very important in helping 
learners become adaptive in different learning environments [8]. 
Previous studies showed that students with positive peer 
relationship were likely to be engaged in academic tasks and 
perform better in school than students without positive peer 
relationships [11, 12, 13]. Our finding supports existing literature 
that learning abilities are related to attitude of learners [5]. 


However, our approach has some limitations. Our algorithm is a 
simple frequency counting. However, since less frequently used 
words have been filtered out, we expected that our results would 
still be robust even with different weighting methods. Moreover, 
no sarcasm, negation or irrealis phenomena were considered. This 
might have a slight effect on our results. 


Future work involves testing robustness of our approach with 
more data. To explore additional emergence, we could also apply 


adjustments to various clustering algorithms [4]. We are 
developing a platform to help teachers quantify student’s attitudes. 
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ABSTRACT 


In computer-based tutoring systems, it is important to assess 
students’ mastery of different skills and provide remediation. In 
this study, we propose a novel neural network approach to 
estimate students’ skill mastery patterns. We conducted a 
simulation to evaluate the proposed neural network approach and 
we compared the neural network approach with one of the most 
widely used cognitive diagnostic algorithm, the DINA model, in 
terms of skill estimation accuracy and the ability to recover skill 
prerequisite relations. Results suggest that, while the neural 
network method is comparable in skill estimation accuracy to the 
DINA model, the former can recover skill prerequisite relations 
more accurately than the DINA model. 


Keywords 


prerequisite discovery, skills, neural network, student modeling, 
cognitive diagnosis model 


1. INTRODUCTION 


In intelligent tutoring systems, assessing students’ skill mastery 
patterns and determining skill prerequisite relationship are two 
important areas of research. Various approaches are proposed to 
solve these two problems, including Educational Data Mining 
(EDM) approaches, such as Bayesian Knowledge Tracing, 
Learning and Performance Factor Analysis (for a comparison see 
[5]), and psychometric approaches, such as Cognitive Diagnostic 
Models (CDMs) [2, 6]. Compared to CDMs, which assess student 
skill mastery based on their responses to a test administered at one 
time point (i.e., no learning occurs during the test), the EDM 
approaches have the advantage of assessing student learning 
dynamically. However, unlike CDMs, which estimate every 
item’s psychometric properties, the EDM approaches often 
assume all test items that measure the same set of skills have the 
same psychometric properties (e.g., same guessing and slipping 
parameters). This assumption is unlikely to be tenable in practice, 
and it may lead to less accurate skill estimation and less efficient 
item selection. While both approaches have their strengths and 
weaknesses, this study will focus on developing a new CDM 
approach using the neural networks, and evaluate the proposed 
approach by comparing it with the current most popular CDM 
method, the DINA (deterministic inputs, noisy “and” gate) model 
[2] using simulated data. 


2. A BRIEF INTRODUCTION TO NEURAL 
NETWORKS 


A neural network is a supervised classification algorithm that 
consists of several layers of neurons (1.e., processing units) [4]. 
Each neuron linearly combines information from previous layers 
and applies a non-linear activation function. The most commonly- 
used activation function is the logistic/sigmoid function. A typical 
feedforward neural network consists of a layer of hidden units and 
a layer of output units. Mathematically, it can be represented as: 


eae sigmoid(1n,1bi g + sigmoid(1p 1b} + XnpWpx)Weq): 


where Y,,g is the output matrix consisting of n subjects’ values on 
q output variables, X;,, 1s the input matrix consisting of n 


subjects’ values on p input variables, bis is a vector of intercept 
values for k hidden units, W,, is the weight matrix between p 


input variables and k hidden units, bi, is a vector of intercept 
values for g output units, and W;,, is the weight matrix between k 
hidden units and g output units. 


One challenge in applying neural networks to estimate students’ 
skill mastery patterns is that students’ skill mastery patterns are 
unobserved. Thus, we only have observed values for the input 
variables (students’ item response patterns) but not for the output 
variables (students’ skill mastery pattern). 


3. METHODOLOGY: THE PROPOSED 
NEURAL NETWORK APPROACH 


To overcome the problem mentioned above, we propose a novel 
neural network model that has the same input and output (..e., 
students’ item response patterns). The core idea underlying our 
approach is to first reduce the input (student item response 
patterns) to a smaller number of hidden units representing 
students’ latent skills and then use these hidden units to best 
reproduce student item response vectors (i.e., output) with the 
restriction of the Q-matrix, a matrix that specifies the set of skills 
measured by each item. A conceptual diagram of the proposed 
q hidden units 


network is shown in Figure 1. 
correspond to = 
skiis : ey 
k hidden units a 


Figure 1. A diagram of the proposed neural network. 
Relations between skill hidden units and output units are 
specified based on the Q-matrix. 


p output units 


p output units 


It is important to note that the relation between the second layer of 
hidden units and output units is specified based on the Q-matrix, 
which specifies which skills are required by each item. Intuitively, 
the network first extracts features from student item response 
patterns and then it dictates the relations between features and 
student item response patterns based on the Q-matrix. 
Mathematically, the model can be represented as follows: 


Lan sigmoid(1n,1bi 
ate sigmoid(1n,1bi,g a sigmoid(1n,1bi x 
+ XnpWo)We,q)Wap © QGp), 
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where ©) represents elementwise multiplication, and Q4, is the 
Q-matrix. 


Similar to a regular neural network, the proposed model uses 
maximum likelihood to define the cost function and it can be 
optimized using some variants of gradient descent (e.g., rprop 
[4]). To speed up the optimization, it is important to choose 
meaningful starting values for the weight matrices. To initialize 
Wp, we can first train a multivariate logistic regression with all 
the theoretically possible skill patterns (1.e., expected theoretical 
plausible skill patterns) as input, and their corresponding expected 
item response patterns (1.e., item response pattern assuming no 
slips and guesses) as output, assuming slipping and guessing 
parameters are 0. Then, we use the weight matrix from this 
multivariate logistic regression as the starting values of the 
proposed neural network. 


4. EVALUATION 


In order to demonstrate the accuracy of the proposed neural 
network, we conducted a preliminary simulation study. Five 
thousand students’ responses (correct/incorrect) to 28 test 
items were generated based on a skill prerequisite model 
shown in Figure 2. Skill prerequisite relations, true model 
used in the simulation (left); recovered using DINA skill 
estimates (middle) and neural network skill estimates (right) 


To evaluate the recovered prerequisite relationship, we counted 
the number of estimated causal links that were not in the true 
model, and the number of missing causal links that were in the 
true model. 


and a Q-matrix (available upon request). The guessing and 
slipping parameters for all items were set to 0.1. We compared the 
proposed method with the DINA model in terms of accuracy of 1) 
student skill pattern estimates and 2) skill prerequisite relation 
recovery. Accuracy of skill pattern estimates is defined as: 


accuracy 
‘ |estimated skill pattern matrix — true skill pattern matrix| 


, 


n*q 


where n is the sample size, and g is the number of skills in the Q- 
matrix. The skill prerequisite relations were recovered by using a 
Bayesian network to model the relations among estimated student 
skills. The causal direction in the Bayesian network is determined 
by the following heuristic [1]: 


If P(skilll=0) < P(skill2=0), then skilll is the prerequisite of skill2. 


‘ore i 


Figure 2. Skill prerequisite relations, true model used in the 
simulation (left); recovered using DINA skill estimates 
(middle) and neural network skill estimates (right) 


To evaluate the recovered prerequisite relationship, we counted 
the number of estimated causal links that were not in the true 
model, and the number of missing causal links that were in the 
true model. 


We programed our proposed neural network using Python. The 
number of hidden units in the first layer was set to 56. The 
number of hidden units in the second layer was set to seven, 
corresponding to seven skills in the Q-matrix. The Rprop 
algorithm was used to optimize the neural network. For the DINA 
analysis, we used the CDM R package [6]. For the Bayesian 
network analysis, we used the bnlearn R package’s mmhc 
algorithm [7] and Rgraphviz R package [3]. 


The results suggested that the proposed method had similar or 
slightly better accuracy (89.2%) at estimating skill patterns than 
the DINA model (87.9%). Moreover, the proposed method was 
better at recovering the skill prerequisite relations. The recovered 
skill prerequisite relations by the DINA model and the proposed 
method are shown in Figure . The prerequisite relations recovered 
based on the DINA skill estimates only contained two arcs from 
the true model (1.e., S1 to S2, S1 to S3), and they contained two 
arcs that were not in the true model (SI to S4, S2 to S3). The 
prerequisite relations recovered based on the neural network skill 
estimates contained all the arcs from the original model, as well as 
two arcs that were not in the true model (S1 to S4, S2 to S3). 
Overall, the results suggested that the proposed network had 
slightly better skill estimation accuracy than the DINA model and 
it was more accurate at recovering skill prerequisite relations than 
the DINA model. 


5. CONCLUSIONS AND DISCUSSION 


This study proposed a novel neural network approach to estimate 
student skill mastery patterns in CDM. Traditionally, parameter 
estimation of models with latent variables usually depends on 
Expectation Maximization or Markov Chain Monte Carlo 
methods. The proposed neural network approach frames the latent 
variable model problem as a supervised problem and it solves it 
using the gradient descent method. Initial evidence suggests that 
the proposed method has comparable skill estimation accuracy as 
the DINA model, but it can recover skill prerequisite relations 
better than the DINA model. Further research is needed to 
rigorously evaluate this method. 
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ABSTRACT 


The number of students that can be helped in a given class 
period is limited by the time constraints of the class and 
the number of agents available for providing help. We use 
a classroom-replay of previously collected data to evaluate 
a data-driven method for increasing the number of students 
that can be helped. We use a machine learning model to 
identify students who need help in real-time, and an inter- 
action network to group students who need similar help to- 
gether using approach maps. By assigning these groups of 
struggling students to peer tutors (as well the instructor), 
we were able to more than double the number of students 
helped. 


Keywords 
Introductory Programming; Learning Analytics; Machine 
Learning; Peer Tutors; Educational Data Mining 


1. INTRODUCTION 


While a typical classroom may be full of students experi- 
encing the same problem and students who have solved that 
problem, this expertise is rarely utilized. Instead, often the 
only source of help is the instructor, who is most likely un- 
able to help all the students who need help within the time 
constraints of the class period. ‘To address this problem, 
we propose and evaluate several methods for improving the 
efficiency of student assistance using machine learning. 


Diana et al. [1] showed that low-level log data from the Al- 
ice introductory programming environment can be used to 
accurately predict student grades, and that they could in- 
crease the number of students helped by matching struggling 
students to a peer tutor based on the similarity of their code. 
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A subsequent study [2] found that the accuracy and inter- 
pretability of the previously reported predictive model could 
be improved by increasing the grain size of the features from 
a vocabulary of terms derived through natural language pro- 
cessing (NLP) to small snippets of code. We explore how 
this improvement impacts peer tutor matching and the ef- 
ficiency of providing help more generally. Additionally, we 
use an interaction network graph to test if students who may 
benefit from the same kind of help can be grouped together, 
increasing the efficiency of the instructor or peer tutor. 


2. METHODS 


The data used in the current study were originally collected 
by Werner et al. [3] as part of a two year project explor- 
ing the impact of game design and programming on the de- 
velopment of computer science skills. The students were 
asked to complete an assessment task called the Fairy As- 
sessment. The current experiment closely follows the data 
transformation methodology reported in [1] to convert raw 
log data into program representations called code-states and 
the code-state complexity reduction methodology reported 
in [2] to reduce code-states to smaller, code-chunks. 


We used ridge regression to predict students’ grades. We 
compared two methods for generating the features inputted 
into the regression. In the first method, features were a vo- 
cabulary of NLP terms generated from the students’ code- 
states. In the second method, each code-state was first con- 
verted into a list of code-chunks, and then into a chunk- 
frequency vector. A chunk-frequency vector is a vector whose 
length is equal to the total number of features being consid- 
ered in the model. Each value in the vector corresponds to 
the frequency of the respective code-chunk. 


The predicted grades were also used to estimate which stu- 
dents need help and which students may be able to provide 
help. We call the students classified as needing help using 
their actual grades low-performing students. This classifica- 
tion serves as the ground-truth that we use to evaluate our 
predictive model. In a real world implementation, we would 
not have access to the actual grades, so we must estimate 
them and use those estimates to classify students as need- 
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ing help. If a student’s predicted grade was in the bottom 
quartile, and they have not been helped or are not currently 
being helped ("helped” status persists across time), then that 
student was added to the group of students who still need 
to be helped, which we call the Help Pool. If a student’s 
predicted grade was in the top quartile, and they are not 
currently helping a student, then that student was added to 
the group of students who may be able to help other stu- 
dents, which we call the Tutor Pool. For each student in 
the Help Pool, we first checked to see if the instructor was 
available to help. If so, the instructor was assigned to that 
student. If the instructor was unavailable (i.e., helping an- 
other student), then we searched for a peer tutor. We used 
a network graph of each code-state (or code-chunk frequen- 
cies) for each user to match tutees to tutors. We searched 
for tutors who shared a common ancestor node (i.e., shared 
a previous program state) with the tutee. These tutors were 
added to a pool of potential tutors. From that pool we se- 
lected the tutor with the common ancestor node that was 
closest (i.e., least number of steps away) to the tutee’s cur- 
rent node. The same method applied if segmenting was used, 
except that instead of matching the instructor or peer tutor 
to one student, the instructor or tutor was matched to a 
segment of students with a similar problem. 


2.1 Efficiency Index 


While the primary goal of our previous work [1] was to eval- 
uate how well our model could correctly classify students 
who would go on to have a low final grade (low-performing 
students), the primary goal of the current experiment is to 
evaluate how efficient this intervention would be. That is, we 
were interested in what percentage of those low-performing 
students could be helped, and how we can maximize that 
percentage. We call this ratio the Efficiency Index (EI), 
and define it formally as: 


_ LowPer formingStudents Helped/ Being Helped 
7 LowPer formingStudents 
(1) 


The EI can be further broken down into the percentage of 
low-performing students helped by the instructor (E£J;) and 
the percentage of low-performing students helped by peer 
tutors (EIpr). 


EI 


3. RESULTS 


We compared models using a linear mixed model with the 
measure of interest as the dependent variable, model as a 
fixed effect, and time bin as a random effect. 


We hypothesized that we can use low-level programming 
data to group similar low-performing students together so 
that they can be helped as a group. To test this, we first 
replicated our previously reported model to use as a baseline 
measure. ‘Then, we generated a new model that incorporated 
segmenting. Both models used NLP features in a ridge re- 
gression and an interaction network graph built using code- 
states as nodes. We found that the EI (M=0.467, SD=0.210) 
of the model that incorporated segmenting was significantly 


We also hypothesized that using the presence or absence of 
code-chunks as model features would improve the perfor- 
mance of the model. To test this, we generated a model 
using a sample of the code-chunks from our previous work 
that were shown to be good predictors of learning outcomes 
[2]. We generated a model using these 16 code-chunk fea- 
tures (rather than the NLP-derived terms used in the base- 
line model), and found that this code-chunk model had a 
significantly lower (p<.001) RMSE (M=0.246, SD=0.064) 
than the baseline model (M=0.263, SD=0.073). 


Finally, we hypothesized that a network graph generated 
using code-chunks as nodes would lead to greater coverage 
and a higher EI. To test this, we generated a model using 
the same 16 code-chunks described above as features in the 
regression. A network graph was also generated to incorpo- 
rate segmenting. However, instead of each node correspond- 
ing to a code-state, each node corresponded to a chunk- 
frequency vector. Representing nodes as chunk-frequency 
vectors more than doubled the coverage (coverage=0.924) 
compared to the network graph generated using code-states 
(coverage=0.374). The EI of the model using chunk-frequency 
vectors to generate the network graph (M=0.813, SD=0.128) 
also had a significantly higher (p<.001) EI than the model 
using code-states (M=0.428, SD=0.217). 


4. CONCLUSIONS 


In this paper, we explored a method for increasing the amount 
of help given in a typical class period. Our previous work 
demonstrated that we can use a predictive model to accu- 
rately identify students who may need help. We built off of 
this work in two ways. First, we improved the accuracy of 
the predictive model by using more relevant features. Sec- 
ond, we drastically increased the number of students able 
to be helped from, on average, 3.72 to 9.92 by grouping 
low-performing students together to be helped as a group 
(in combination with better model features). These results 
suggest that using low-level log data to group and match 
low-performing students to peer tutors may be an effective 
way to increase the amount of help given in a classroom. 
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ABSTRACT 


Educational technology commonly leverages multiple-choice ques- 
tions for student practice, but short-answer questions hold the po- 
tential to provide better learning outcomes. Unfortunately, students 
in online settings often exhibit little effort when crafting short- 
answer responses, instead often produce off-topic (or invalid) re- 
sponses that are off-topic and do not relate to the question being 
answered. In this study, we consider the effect of entering on-topic 
short-answer response on student learning and retention. To do 
this, we first develop a machine learning method to automatically 
label student open-form responses as either valid or invalid using a 
small amount of hand-labeled training data. Then, using data from 
several high school AP Biology and Physics classes, we present 
evidence that providing valid short-answer responses creates a pos- 
itive educational benefit on later practice. 


Keywords 
Best educational practices, Cognitive psychology, Machine learn- 
ing, Natural language processing, Mixed effect modeling 


1, INTRODUCTION 


An important part of the learning process is recalling learned in- 
formation from memory [3]. In most educational situations, this 
practice is accomplished by asking students practice questions re- 
lated to the learning material. In online learning, multiple-choice 
questions are by far the most common, following by short-answer 
questions. While multiple choice questions are attractive due to the 
ease of machine scoring, it is worth asking whether is is the best op- 
tion for improving learning. Indeed, multiple-choice questions are 
oft-criticized because they are perceived to require only shallow 
recognition processes to complete [7]. Short-answer responses, by 
contrast, are generally believed to have a stronger learning bene- 
fit to students as they afford more difficult reconstructive cognitive 
processes. 


Prior experiments examining the relative benefits of multiple-choice 
and short-answer have been mixed, with short-answer questions 
generally found to improve learning only when subsequent feed- 
back is provided [2, 4]. One factor that has not been examined 
in prior research, however, is how the quality of short-answer re- 
sponses provided by students contribute to learning. In online ed- 
ucational settings where students lack oversight, students do not 
always take the time to craft thoughtful short-answer responses. 
Instead, they often opt to to quickly enter an off-topic response to 
advance their progress or view feedback. 


We hypothesize that students derive greater learning benefits when 
they produce valid short-answer responses than when they do not, 
even when those valid responses are incorrect. While it is possi- 
ble to hand-label student responses as valid or invalid for a small 
number, it is not feasible to do this at large scale. To circumvent 
this scalability issue, we devise a machine-learning based classi- 
fier trained on a small number of hand-labeled exemplars. We then 
leverage this classifier to analyze the impact of entering valid re- 
sponses on learning. 


2. AUTOMATIC VALIDITY 
CLASSIFICATION 


Due to the large number of words in student responses, our method 
for automatically classifying student short-answer responses as valid 
or invalid begins with parsing to reduce the overall size of the fea- 
ture space. First, we attempt simple spelling correction for each 
word of a student’s response. Following spelling correction, which 
strip common stopwords (e.g. of, as, is, etc) and replace any non- 
sensical words (e.g., random keyboard presses) with a specially de- 
fined tag, which has the effect of mapping all unknown words to 
the same label. Finally, we stem acceptable words in a student re- 
sponses to further reduce the dimensionality of our feature space. 
Finally, we convert the parsed student response to a numerical fea- 
ture vector using a bag-of-words model. 


Following parsing, we employ a random forest [1] to classify each 
student response as either valid or invalid. We measured the per- 
formance of our method using 5-fold cross-validation on 20,000 
hand-labeled responses and found our accuracy to be 95%. 


3. ANALYSIS OF VALID RESPONSES ON 
LEARNING 


We now turn our attention to evaluating the impact of providing 
valid short-answer responses on future learning outcomes using 
real-world educational data. 


Our dataset is taken from a pilot study of our online learning plat- 
form, OpenStax Tutor [6], which was conducted during the 2015-— 
2016 academic year. OpenStax Tutor has two important features 
relevent to our discussion. First, it uses a hybrid answering for- 
mat [7] that first requires students to enter a short-answer response 
to the question and requires the student to select the correct an- 
swer from a multiple-choice list. Second, OpenStax Tutor employs 
a concept known as spaced practice, which automatically assigns 
questions to students on material that they have learned in previous 
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assignments. The purpose of this feature is to ultimately improve 
long-term knowledge retention, but we leverage these spaced prac- 
tice observations as an opportunity to observe the effects of entering 
valid short-answer responses on later practice. 


The pilot consisted of two separate high school courses, AP Biol- 
ogy and standard (non-AP) Physics. A total of 207 students (74 AP 
Biology, 154 Physics) and 8 instructors (4 AP Biology, 4 Physics) 
participated in the pilot. There are roughly 100,000 short-answer 
responses on initial practice problems, and 20,000 of these answers 
were hand-labeled by subject matter experts as being valid or in- 
valid responses to the given question. The average spaced practice 
problem occurs roughly 3 weeks after the initial practice on the 
topic is complete. 


To analyze the impact of entering valid open-form responses we 
adopt a mixed effect logistic regression model [5]. Our binary out- 
come is whether or not the student answered the spaced practice 
question for a given topic correctly. Our random effects (R) are 
nuisance quantities for student ability, topic difficulty, and instruc- 
tor quality. We examine two different fixed effects in our model: M, 
the number of multiple-choice questions that a student answered 
correctly on a given topic and V, the number of valid short-answer 
responses that a student provided on a given topic. 


We consider four separate models for student success on spaced 
practice questions. Each model includes the random-effects R. We 
then separately consider the effects of the fixed effects M and V as 
well as considering both fixed effects jointly. We fit all four models 
to the AP Biology and Physics datasets separately. The results for 
AP Biology and Physics are shown on Table | and Table 2, respec- 
tively. In order to determine which model provided the best fit, we 
used the Akaike information criterion (AIC) metric, which imposes 
a penalty that penalizes modes with too many parameters to prevent 
overfitting. Models with lower AIC values are deemed better than 
models with higher AIC values. 


For AP Biology, we found that the R+ V model achieved the lowest 
AIC implying that the number of valid responses provided a better 
predictor of success than the number of correct multiple-choice se- 
lections. The coefficient for the number of valid responses is posi- 
tive and statistically significant, which matches our hypothesis that 
more valid responses improves student retention. For Physics, we 
note that R+M-+V provides the lowest AIC value, and is signifi- 
cantly better than considering R+ M alone. This implies that both 
factors together produce better modeling fitting. 


Table 1: Summary of AP Biology Data Models 


Dependent variable: 


Correct on Spaced Practice 


(R) (R+M) (R+V) (R+M-+YV) 
Number Core Correct 0.030* —0.009 
(0.016) (0.027) 
Number Core Valid 0.034** 0.040* 
(0.013) (0.023) 
Constant 0.613*** 0.467*** 0.427*** 0.437*** 
(0.075) (0.107) (0.105) (0.109) 
Observations 1,987 1,987 1,987 1,987 
Log Likelihood —1,278.010 —1,276.102 —1,274.653 —1,274.599 
Akaike Inf. Crit. 2,562.019 2,560.203 2,557.305 2,559.199 
Note: *p<0.1; **p<0.05; ***p<0.01 


Table 2: Summary of Physics Data Models 


Dependent variable: 


Correct on Spaced Practice 


(R) (R+M) (R+V) (R+M-+YV) 
Number Core Correct 0.082*** 0.076*** 
(0.013) (0.013) 
Number Core Valid 0.097*** 0.078*** 
(0.023) (0.022) 
Constant 0.002 —0.316*** —0.105 —0.377*** 
(0.074) (0.087) (0.079) (0.089) 
Observations 4,000 4,000 4,000 4,000 
Log Likelihood —2,703.761 —2,682.312 —2,693.697 —2,675.836 
Akaike Inf. Crit. 5,413.522 5,372.623 5,395.394 5,361.672 
Note: *p<0.1; **p<0.05; ***p<0.01 


4. CONCLUSIONS 


We have developed a machine-learning based method for classify- 
ing student open-form responses to questions as being either valid 
(on-topic) or invalid (off-topic) using a combination of intelligent 
parsing and supervised classification. We have further presented 
evidence that students who spend time crafting thoughtful responses 
show improved learning outcomes when practicing earlier material. 


The results that we have derived in this work are the result of 
searching for patterns in existing data and relied on students de- 
ciding of their own volition whether or not to enter a valid short- 
answer response. Future research in this area will involve more 
highly controlled study in which the opportunity to enter a short- 
answer response will be controlled by our learning system. This 
will allow us greater control over our experimental setup and aid in 
the interpretation of our final result. 
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ABSTRACT 


After developing an intelligent tutoring system (ITS), or any other class of 
learning environments, one of the first questions that should be asked is 
whether the system was effective in helping students learn the targeted 
skills or subject matter. In this study, we employed two educational data 
mining models (Additive Factor Model, AFM and Performance Factor 
Analysis, PFA) which are available in Datashop (LearnSphere) to assess 
the learning gains on 5 theoretical levels of adults. With AFM, for the KC 
models tested, the results showed positive learning gains for the 
Rhetorical Structure knowledge component in contrast, for the PFA 
model, adults did not learn from either successes or failures. 


Keywords 


Learning gains, Theoretical Levels, Additive Factor Model, 
Performance Factor Analysis, CSAL Autotutor 


1. INTRODUCTION 


One of the first questions that is asked after developing an 
intelligent tutoring system (ITS) is whether the system was 
effective in helping students learn the targeted skills or subject 
matter. Learning gains are based on the performance of the 
students as they work on the system over time with many 
opportunities for learning. These learning gains can be assessed at 
a fine-grained level by tracking the learning of specific knowledge 
components (KCs), which are particular skills, strategies, 
concepts, or facts, as articulated in the Knowledge-Learning- 
Instruction (KLI) framework [2]. In this paper, we analyze the 
learning of the theoretical components (KCs) which were based 
on models of comprehension that adopt a multilevel framework in 
our dialogue-based intelligent tutoring system, called CSAL 
AutoTutor, that was designed to help struggling adult readers 
learn reading comprehension strategies. The Graesser and 
McNamara framework identifies 5 levels [1]: words (W), syntax 
(S), the explicit textbase (TB), the referential situation model 
(SM), the discourse genre and rhetorical structure (RS, the type of 
discourse and its composition). And, the computational models 
used in the analysis were Additive Factor Model (AFM) and 
Performance Factor Analysis, both of which were from Datashop 
(LearnSphere) [3]. 3 questions will be addressed in this paper: 1. 
When training the adults to read, did the performance of the adults 
follow the levels of text difficulty? 2. Did adults’ learning gains 
increase after using the Autotutor which just provided some 
instructions on reading comprehension strategies and some 
practice? 3. Did adults learn from successes or failures? 
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2. METHODOLOGY 


The adult readers were 52 adults in Atlanta and Toronto who 
participated in a study of 100 hours of intervention that was 
conducted by the CSAL team, and they completed up to 30 
lessons throughout the intervention. Each lesson had between 10 
and 30 multiple choice questions to assess their performance 
When they answered a question incorrectly, they were given a hint 
to see whether they selected correctly among the two remaining 
options. However, in this analysis we only considered 
performance on their first type, not the follow-up. 


The original measures in the AFM model included performance, 
practice opportunities (the number of questions they answered in 
a lesson), the knowledge components (KCs were the 5 theoretical 
components), and subject (participant). For model fitting, pre-test 
scores and text difficulty (easy, medium, and hard) were entered 
into the original models (Table 1). Ultimately, we ran 10 models 
(5 AFM models and 5 PFA models) for the KC approaches, and 
determined which AFM and PFA models had the _ best 
performance, based on AIC, BIC, and Loglikelihood. 


Table 1. Models Construction by Adding New Variables 


Model 2 | Pre-test score, Text Difficulty 


Model 3 | Pre-test score, Text Difficulty: KC Model 
Model 4 | Pre-test score, Practice Opportunity: KC Model 


Pre-test score, Text Difficulty: Practice Opportunity: 
KC Model 


* These models are basically logit mixed effect models. The 
interactive effect. 


3. RESULTS AND DISCUSSION 


Analyses of the 10 models consistently showed that model 3 was 
the best model, yielding the lowest AIC BIC and Loglikelihood 
scores. 


Model 5 


“>” refers to 


Both Table 2 (AFM results) and Table 3 (PFA results) confirm 
the obvious expectation that pretest score is a strong predictor of 
adults’ performance. Also, only for Rhetorical Structure, 
performance decreased as a function of text difficulty. This is 
consistent with the Graesser and McNamara’s multilevel 
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theoretical framework that distinguishes the deeper discourse 
levels of processing (such as the Situation Model and Rhetorical 
Structure) from the basic reading levels (such as Words and 
Syntax) [1]. As shown in table 2, only for Rhetorical Structure, 
performance significantly got better as the practice opportunity 
increased, but the case of the other KCs was different. As shown 
in table 3, although cumulative correctness had significant 
interactions with Syntax and Situational Model, while cumulative 
incorrectness had significant interactions with Syntax and 
Textbase, the estimates of these interactions were all negative, 
which indicated that the performance got worse, no matter adults 
experienced more successes or failures on these KCs. And, for 
other KCs, the coefficients drifted to 0. 


Table 2. AFM Output of Model 3 — Theoretical Levels 


Estimate SE Z Score P-value _ Sig. 


“Intercept —0.675. 0.25. 2.66—S—«< WT 
Pre-test Score 0.140 0.03 4.97 0.00 *** 
PO: RS 0.001 0.00 Dall 0.02 * 
PO: S -0.124 0.02 -5.16 0.00 *** 
PO: SM -0.003 0.00 -3.69 0.00 *** 
PO: TB -0.016 0.00 -4.98 0.00 *** 
PO: W -0.004 0.00 -0.95 0.34 
RS : Hard -1.805 0.19 -9.73 0.00 *** 
S: Hard 0.822 0.28 2.94 0.00 ** 
SM : Hard -O.111 0.18 -0.62 0.54 
TB: Hard 0.014 0.19 0.07 0.94 
W : Hard -0.204 0.30 -0.69 0.49 
RS : Medium -1.241 0.18 -7.07 0.00 *** 
S: Medium -0.078 0.26 -0.30 0.77 


SM: Medium -0.035 0.18 -0.20 0.84 
TB: Medium 0.133 0.19 0.71 0.48 
W: Medium 0.529 0.29 1.84 0.07 


*PO refers to practice opportunity. RS refers to Rhetorical Structure. S 
refers to Syntax. SM refers to Situational Model. TB refers to Textbase. 
W refers to Word. Easy, Medium, Hard are three levels of text difficulty. 


Table 3. PFA Output of Model 3 — Theoretical Levels 


Estimate SE Z Score P-value _ Sig. 


Intercept 0.671 0.26 2.60 0.01 ** 
pretest 0.145 0.03 4.87 0.00 *** 
CC: RS 0.000 0.00 -0.12 0.91 

CCE S -0.127 0.04 -3.47 0.00 *** 
CC: SM -0.005 0.00 -2.32 0.02 * 
CC: TB -0.008 0.01 -1.30 0.19 

CC: W -0.004 0.01 -0.69 0.49 

CI: RS 0.005 0.00 1.37 0.17 
Cle -0.123 0.04 -3.14 0.00 ** 


CI: SM 0.001 0.00 0.41 0.68 

CI: TB -0.031 0.01 -2.77 0.01  ** 
Cl: W -0.002 0.02 -0.13 0.90 

RS : Hard -1.808 0.19 -9.74 0.00 *** 
S: Hard 0.828 0.37 Dae 0.03. * 
SM: Hard -0.099 0.18 -0.55 0.58 

TB: Hard -0.069 0.20 -0.35 0.73 

W : Hard -0.209 0.30 -0.69 0.49 

RS : Medium -1.248 0.18 -7.10 0.00 *** 
S: Medium -0.079 = 0.27 -0.29 0.77 


SM: Medium -0.023 0.18 -0.13 0.90 
TB: Medium 0.068 0.19 0.35 0.72 
W: Medium 0.524 0.30 LeaT 0.08 


*CC and CI refer to cumulative correctness and cumulative 
Incorrectness. Others are the same as Table 2. 


4. CONCLUSIONS 


The model comparison revealed that practice opportunity, adults’ 
prior literacy skills, KC model (theoretical levels) and text 
difficulty were factors influencing adults’ performance. From the 
interactions between theoretical levels and text difficulty, we can 
draw the conclusion that adults’ performance on Rhetorical 
Structure and Situational Model matched the difficulty levels of 
the texts used in the lessons of the two KCs, that is, they did better 
on easy texts and worse on medium and hard texts. But for the 
basic reading levels (Word, Syntax, and Textbase), situations were 
different. According to the results of AFM model, the learning 
gains on deeper discourse levels of processing (Rhetorical 
Structure) increased, because adults’ performance became better 
when they continuously got practice opportunities. There were no 
learning gains observed on KCs like Situational Model, Syntax, 
Textbase, and Word. From results of PFA model, we didn’t 
observe significant learning gains from either successes or 
failures. 
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ABSTRACT 


Blended courses have become the norm in post-secondary 
education. Universities use large-scale learning management 
systems to manage class content. Instructors deliver read- 
ings, lectures, and office hours online; students use intelli- 
gent tutors, web forums, and online submission systems; and 
classes communicate via web forums. These online tools al- 
low students to form new social networks or bring social 
relationships online. ‘They also allow us to collect data on 
students’ social relationships. In this paper we report on 
our research on community formation in blended courses 
based on online forum interactions. We found that it was 
possible to group students into communities using standard 
community detection algorithms via their posts and reply 
structure and that the students’ grades are significantly cor- 
related with their closest peers. 


Keywords 
Educational Data Mining, Graph data mining, Social Net- 
works, Blended Courses 


1. INTRODUCTION 


Improvements in technology have facilitated new models of 
student and instructor engagement. Students now supple- 
ment the traditional course structure with online materials. 
Instructors can share class material online, have an online 
discussion forum, or make quizzes and homework submis- 
sions online. This in turn provides a wealth of new data on 
student behaviors that we can use to study students’ social 
relationships. In particular it allows us to study the impact 
of these social ties on course outcomes. 


In prior work Brown et. al. showed that students in MOOCs 
form pedagogically-relevant, and homogeneous social net- 
works. Brown et. al. has shown that students can be clus- 
tered into stable communities based upon their pattern of 
online questions and replies [1]. They have also shown that 
students’ final grades are significantly correlated with those 
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of their closest peers and community group. They have also 
shown that these communities, while homogeneous in terms 
of performance, are not united by their incoming motiva- 
tions for enrolling in the course nor for their prior experience 


level [2]. 


To date these results have only been found in MOOCs where 
the user forum represents students’ primary connection to 
one-another, and almost all relevant course interactions oc- 
cur online. Students in blended courses, by contrast, often 
have preexisting social ties that carry over from prior courses 
at the same institution. In this paper we show that while 
forum interactions are not the only means of communication 
between students, they still define the same communities as 
was found in MOOCs and that the students’ final grades 
are significantly correlated with those of their community 
members. 


2. DATASET INFORMATION 

In this paper we report on studies of three distinct courses, 
“Discrete Math-2013”, “Discrete Math-2015” and “Java Pro- 
gramming Concepts-2015”. All three are undergraduate com- 
puter science courses, offered at NC State and include sig- 
nificant blended components. Discrete Math-2015 and Java 
Programming Concepts-2015 occurred contemporaneously 
during the Fall 2015 semester while Discrete Math-2013, a 
previous offering of Discrete Math-2015, was offered in Fall 
2013. 


3. METHODS 


3.1 Defining Social Interactions 

Each node in our social networks represents an individual 
participant in the class. In the first class anonymous post- 
ing was allowed, so we have an unknown user related to all 
the anonymous posts. Social relationships are represented 
as arcs. We define a social relationship based upon direct 
and indirect replies in the user forum. Our method was sim- 
ilar to that of Brown et. al. [2]. We defined an edge between 
A and B if B replied to a thread after A had done so. This 
interaction can include starting the original thread, replying 
with a follow-up, or posting a feedback on a reply. We then 
aggregate these edges to form a weighted graph containing 
arcs for all of the relations. We assume that anyone who 
posts on a thread has read the prior comments before doing 
so. Thus it defines a form of social interaction between the 
participants as the students are expressly choosing to make a 
public reply to one another. For the purposes of the present 
analysis we included only students in our network and thus 
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Figure 1: Communities generated on Discrete Math 
2013 class 


confined our social relationships to between-student connec- 
tions. 


3.2. Graph Analysis 


For each of the graphs we generated, we removed the iso- 
lated vertices and performed clustering using the method 
described in [2, 1]. Our clustering method is an iterative 
process where we evaluate the modularity of graphs with 
an increasing number of clusters until we find a limit point 
where the modularity almost stops growing, which indicates 
the natural cluster number. After finding the natural num- 
ber, on each iteration we generated the clusters via the 
Girvan-Newman edge-centrality algorithm|3]. On each iter- 
ation the algorithm removes the most central edge and and 
repeats until a set of k disjoint clusters has been produced. 
We then assessed whether or not the grade distributions 
in different clusters are significantly different by calculating 
the Kruskal-Wallis (KW) correlation between cluster assign- 
ment and grade. Kruskal-Wallis is a nonparametric analogue 
to the more common ANOVA test [4]. 


4. RESULTS 


In graphs generated for Discrete Math 2014, we found that 
the graph reaches its natural cluster number at 42. We 
performed the Girvan Newman clustering and the resulting 
clusters can be seen in Figure 1. In this graph, each node 
represents a community, the size of the nodes shows the 
number of members and the color shows their average grade. 
We can observe that the KW correlation between cluster 
number and the grades is statistically significant ( p = 0.044 
< 0.05 ), which is similar to the results in MOOCs. 


Our results show that, for Discrete Math 2015 ( p = 0.004 < 
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0.05 ) and Java Programming Concepts 2015 ( p = 0.015 < 
0.05 ) graphs, there is a similar significant KW correlation 
between student grades and their communities. 


5. DISCUSSION, CONCLUSIONS AND FU- 
TURE WORK 


In this paper, we generated a social graph between students 
in three different blended courses based on forum interac- 
tions. We found that similar to MOOCs, communities are 
formed in these graphs whose members tend to have similar 
grades. This is consistent with prior work which indicates 
that student communities on forum may be used to predict 
course outcomes [1, 2]. 


Having access to these social graphs can help instructors 
to identify the communities formed among students which 
can be used to find the students who need more help earlier. 
Our research does not show causality. Thus more research is 
needed to find out whether being in the communities makes 
their grades similar, or students are just likely to interact 
with others who are more like them. If we find out that the 
community membership has an effect on students’ perfor- 
mance, we can use this information to identify isolated or 
poorly-performing groups early in the course and intervene 
by encouraging them to make contact with better students 
or seek help as a group. 


There has been much work done on how forum interactions 
in MOOCs, being a hub in a social network or how being at 
the center of the graph could affect students’ performance. 
We can use these graphs to conduct more research on which 
interaction levels will lead to better grades. 


In further work we plan to address whether or not we can 
identify other types of social ties in blended courses, since 
the communications are more complicated. 
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ABSTRACT 


In this paper, we propose an automatic evaluation method 
for the descriptive type test. The method is based on Re- 
current Neural Networks trained on a non-labeled language 
corpus and manually graded students’ answers. The ex- 
perimental results show that the proposed method is the 
second best result among five conventional methods, includ- 
ing BLEU, RIBES, and several sentence-embedding meth- 
ods. And, the proposed method gives the best performance 
among several sentence embedding methods. 


Keywords 
RNN, LSTM, Language Model, Essay Scoring 


1. INTRODUCTION 


Twenty-first-century skills are advocated in the educational 
field. Compared to traditional knowledge-based education 
evaluated by multiple-choice tests, the evaluation of twenty- 
first-century skills is very difficult. A descriptive test is one 
solution to the problem, although the cost of scoring is pro- 
hibitive. In this paper, we propose a method to automati- 
cally score descriptive type tests to solve the problem stated 
above. The method uses long short-term memory (LSTM) 
recurrent neural networks (RNN) to score the answers writ- 
ten in natural language. The method requires two kinds of 
data sets. 
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One is a large language corpus used for pre-training of RNN. 
As pre-training, the RNN-based language model is trained 
using the corpus. A vector given by a hidden layer in the 
networks is thought to embed the meaning of processed sen- 
tences. Thus, the proposed method calculates the similar- 
ity between two vectors given by processing model answers 
and student answers on RNN. The other data set is a small 
labeled corpus that consists of model answers, student an- 
swers, and manually annotated scores of student answers. 
The labeled corpus is used for training of the RNN. 


2. PROPOSED METHOD 


The RNN framework used in the paper is shown in Fig. 1. 
As shown in the figure, the proposed method uses two kinds 
of corpora and two kinds of training parts. They are the 
pre-training of word embedding and the main training of 
the LSTM-type RNN [3]. 

Here, we express the sentence (s) as the sequence of words 
S=W1,°°: ,Wt,:::,wr. The word-embedding part projects 
the input word of time t (w;) to high-dimension vector tw, € 
R*~ as follows: 


= E' wu, (1) 


where wy, € R'Y! is the one-hot vector of w; and E € 
RIV!*4w ig the lookup table. Lw,is used as the input for 
the LSMT part. The LSTM consists of four components: 
the forget gate (f;), input gate (i) and output gate (0,), 
and the memory state (c;). These real-valued vectors are 
calculated by the following formulas: 


f, = o(W Xu, + Uyhi_-i1+ by), 

i: = o(Wixw, + Uihi_i + bi), 

Oo, = o(Wixw, + Uohi-1 + bo), 

C¢, = tanh(Wexy, + Uchi_-1 + be), 

ce = fOc-1thOt (2) 


where W and U are weight matrices, and b is the bias vec- 
tor. o(-) and tanh(-) are an element-wise sigmoid function 
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Figure 1: Framework of the proposed method. 


and a hyperbolic tangent function, respectively. Using these 
vectors, hidden-layer vector (h; € R@) is calculated as fol- 
lows: 


h; = Of @) tanh(c;) (3) 


where © is element-wise multiplication. The main training 
part requires a labeled corpus that consists of model an- 
swers, the students’ answers, and manually scored results 
of the students’ answers. By using the labeled corpus, the 
second training part tunes the LSTM whose network con- 
figuration was proposed by Mueller et al. [1]. Using pre- 
trained word-embedding matrix E from the first training 
part, LSTM parameters are trained as follows. 


First, randomly initialize LSTM parameters in Eq. 2. Then, 
duplicate the initialized LSTM (LSTM, and LSTMy in Fig. 
1). One of them is used to process the student’s answer 
and the other is used to process the model answer. We re- 
gard the hidden-layer vector of the sentence end as sentence 
embedding. To calculate the sentence similarity between 
the student’s answer and the model answer, we add a new 
unit between the hidden layers. The unit calculates the L1 
norm based on the similarity between the two sentence em- 
beddings ( h7, and hp, in Fig. 1) by using the following 
formula [1]: 


g(h7,,h7,) = exp(—||hh, — he, ||1) 
ds 
= exp (- S- hp, aa hn, (4) 
I=1 


The similarity calculation is performed only when both sen- 
tence pairs have been processed by the LSTM. Using the 
similarity calculated by Eq. 4 and the manually evaluated 
score, the deviation is back propagated to tune the LSTM 
weights. Here, we restrict the parameters of LSTM,a and 
LSTMy, to the same values. 


3. EXPERIMENTS 


The labeled corpus consists of 10 descriptive type questions 
and their answers. For each question, around 20 answers 
are manually scored. Additionally, there are also four model 
answers for each question. For the pre-training of the word- 
embedding matrix, we use a Mainichi newspaper corpus. 


Since the size of the labeled corpus is very small, we carry 
out a leave-one-out cross-validation test for each question. 
The cross-validation is carried out only for student answers. 
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Figure 2: Experimental results. 


The same model answers are used for training and evalu- 
ation. The LSTM in the paper can only process a pair of 
one student answer and one model answer at the same time. 
Thus, all combinations of student answers and model an- 
swers in the training set are used for training. For the scor- 
ing of the test set, we calculate the average score of several 
model answers. The evaluation measure is the correlation 
coefficients between the manual and the automatic scoring 
results. 


Fig. 2 shows the experimental results. As baseline results, 
we show the results of BLEU, RIBES, and the Doc2Vec 
(D2V) cosine similarity method with the NewsPaper(NP) 
corpus and Wikipedia(Wiki) corpus by referring to the con- 
ventional research|2|. As shown in the figure, the proposed 
method never gives a negative correlation coefficient. Mean- 
while the conventional sentence-embedding-based methods 
give negative correlation coefficients. Additionally, the pro- 


posed method gives the best results on average among sentence- 


embedding methods, which are two kinds of D2V and the 
proposed method. Compared to all methods, the proposed 
method offers the second-best performance. 


4. CONCLUSIONS AND FUTURE WORKS 


We proposed the LSTM-based automatic scoring method for 
descriptive tests. We carried out experiments using actual 
learning logs. According to the experimental results, the 
proposed method gives the best performance among several 
sentence-embedding methods, and the second-best results 
among five methods including BLEU and RIBES. 
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ABSTRACT 


We describe a graph-based modelling approach to explor- 
ing interactions associated with a change in students’ affec- 
tive state when they are working with an exploratory learn- 
ing environment (ELE). Student-system interactions data 
collected during a user study was modelled, visualized and 
queried as a graph. Our findings provide new insights into 
how students are interacting with the ELE and the effects 
of the system’s interventions on students’ affective states. 


1. INTRODUCTION 


Much recent research has focussed on Exploratory Learn- 
ing Environments (ELEs) which encourage students’ open- 
ended interaction with a knowledge domain, combined with 
intelligent components that aim to provide pedagogical sup- 
port to ensure students’ productive interaction. The aim of 
this feedback is to balance students’ freedom to explore alter- 
native task solution approaches while at the same time pro- 
viding sufficient support to ensure that the intended learn- 
ing goals are being achieved [6]. Here we report on recent 
work into identifying interaction events that are associated 
with a change in students’ affective state as they interact 
with an affect-aware ELE called Fractions Lab. We adopt 
a graph-based approach to modelling, querying and visual- 
izing the student-system interactions data, extending pre- 
liminary work in this area reported in [8]. In our graphs, 
nodes represent occurrences of key indicators that are de- 
tected, inferred or generated by the ELE, and edges be- 
tween such nodes represent the “next event” relationship. In 
contrast, recent work on interaction networks and hint gen- 
eration (e.g. [4]) uses graphs whose nodes represent states 
within a problem-solving space and edges represent students’ 
actions in transitioning between states. That work uses the 
graph-modelled data to automatically generate feedback for 
the student, whereas we use a graph-based modelling ap- 
proach to investigate the effects of the system’s interventions 
in order to better understand how students interact with the 
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ELE with the aim of improving its support for students. 


2. THE ELE AND USER STUDY 

Fractions Lab is an ELE that is part of the iTalk2Learn 
learning platform targeted at children aged 8-12 years who 
are learning about fractions. As students interact with Frac- 
tions Lab they are asked to talk aloud about their reasoning 
process. This speech, together with their interactions, are 
used to detect students’ affective states using a combination 
of Bayesian and rule-based reasoning [5]. Adaptive support 
is provided based on the student’s performance and detected 
affective state. The affective states detected by Fractions 
Lab can be ranked according to their effect on learning, 
based on previous studies (e.g. [7, 3, 1]). For example, being 
in flow is a positive affective state as it indicates that the 
student is engaging with the learning task well. Confusion is 
mostly associated with realising misconceptions, which also 
contributes towards learning, while frustration and boredom 
are likely to have a negative effect on learning. 


We conducted a user study in which iTalk2learn was used 
by students in a classroom setting. 41 students aged 8-10 
took part, with parental consent, recruited from two schools 
in the UK. Students were given a short introduction to the 
system. They then engaged with the Fractions Lab ELE for 
40 minutes. They then completed an online questionnaire 
that assessed their knowledge of fractions (the post-test). 


The iTalk2Learn platform logged every student-system in- 
teraction, such as fractions being created or changed by stu- 
dents, buttons being clicked, feedback being provided by 
the system, feedback being viewed by students, and the 
system’s detection of students’ affective states. This data 
was then remodelled into a graph form, according to the 
graph data model shown in Figure 1. We see that the data 
model comprises two node types: Event nodes, that cap- 
ture occurrences of key interactions, and EventType nodes, 
that hold additional metadata about each event. Edges la- 
belled NEXT link together successive Event nodes, allowing 
us to build up a sequence of events that describe the his- 
tory of student-system interactions as a student works on a 
task during a session. An edge labelled OCCURRENCE_OF 
links each Event node to an EventType node. 


The data logged by iTalk2Learn was exported as text, parsed 
and pre-processed using Python and the Pandas and py2neo 
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Figure 1: Graph data model for student-system in- 
teraction data. 


libraries, and then loaded into the Neo4j graph database. To 
view the resulting data graph we developed a custom visu- 
alization tool in JavaScript using the Node.js library. Our 
tool allows viewing of large-scale changes in affective state 
as well as details of event sequences. Having interacted with 
these visualizations, we were interested to explore further 
the kinds of events that contribute towards changes in stu- 
dents’ affective state as they work with Fractions Lab. To 
do this, we used Neo4j’s graph query language, Cypher, to 
extract the metadata relating to pairs of consecutive events 
that exhibit a change in a student’s affective state. The 
query below was used to find adjacent Event nodes con- 
nected by NEXT, and the Event'lype nodes they are con- 
nected to by OCCURRENCE OF, such that the affective 
states associated with the Event'Tlype nodes are not equal: 


MATCH (start_event: Event)-[:OCCURRENCE_OF]->(start_type: EventType), 


(end_event: Event)-[:0CCURRENCE_OF]->(end_type: EventType) , 
p = (start_event)-[:NEXT]->(Cend_event) 
WHERE start_type.affective_state in 


["flow", "boredom", "confusion", "frustration"] 
AND end_type.affective_state in 
["flow", "boredom", "confusion", "frustration"] 


AND NOT start_type.affective_state = end_type.affective_state 
RETURN * 


3. RESULTS AND CONCLUSIONS 


We were interested to explore differences in students’ af- 
fective states and interactions compared with their perfor- 
mance. Students’ performance, based on the post-test score, 
was on average 3.83 (SD=1.46; min=0; max=6). A median 
split of students’ scores resulted in a higher- and a lower- 
performing group (high: 27 students; low: 14 students). In 
order to investigate which interactions moved students into 
a different affective state we used association rule learn- 
ing (c.f. [2]) over the data returned by the above Cypher 
query. We found that students are likely to move from flow 
to frustration when provided with reflective prompts in the 
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low-performing group and with open-ended problem solving 
support in the high-performing group. This might imply 
that these types of support are imposing too high a cog- 
nitive demand on students. Additionally, certain interac- 
tions with their fractions may move both categories of stu- 
dent from flow to frustration. Viewing high-interruption or 
low-interruption feedback may move low or high performing 
students, respectively, from flow to confusion. Finally, we 
observed a positive effect of Affect Boost messages for both 
categories of student. 


These findings extend earlier ones reported in [5] with a 
finer-grained analysis of students’ affective state changes, 
identifying several situations where the system’s support 
may need to be modified: (i) reviewing the content of both 
the high- and the low-interruption messages, to see if the 
incidences of confusion can be reduced; (ii) considering ex- 
tending the provision of reflective prompts and open-ended 
support with additional affect boost messages and hints that 
students might also select to view, to mitigate against frus- 
tration; (iii) considering providing more scaffolds when stu- 
dents are manipulating their fractions, for example addi- 
tional low-interruption feedback. Exploratory learning en- 
vironments such as Fractions Lab can generate large volumes 
of student-system interactions data, making their interpre- 
tation a challenging task. We have seen here how modelling 
such data as a graph can open up new data visualization, 
querying and analysis opportunities, leading to new insights 
into how students are interacting with the ELE and the ef- 
fects of the system’s interventions, with the ultimate goal of 
designing improved support for students. 
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ABSTRACT 


In this paper, we describe how we build accurate predictive 
models of students’ performance in a SPOC (small private online 
course). We document a performance prediction methodology 
from raw logging data based on OpenEdX platform to model 
analysis. We attempted to predict students’ performance of 
Computer Structure Lab Course (Fall 2016) offering at Beihang 
University. 28 predictive features extracted for 377 students, and 
our model achieved an AUC (area under curve) in the range of 
0.62-0.83 when predicting one week in advance. This work would 
help to identify at-risk students in a SPOC. 


Keywords 


SPOC, student performance prediction, study behavior analysis, 
educational data mining, at-risk students 


1. INTRODUCTION 


EdX has designed and built an open-source online learning 
platform (OpenEdX) for online education. In addition to offering 
online courses, participating universities are also committed to 
researching how students learn and how technology can transform 
learning both on-campus and online throughout the world. 


Some researches focus on how to predict students’ performance 
by using study-related data. Stapel, M. [1] presented an ensemble 
method to predict students’ performance, which includes six 
classification algorithms. Elbadrawy, A. [2] developed miulti- 
regression models based on regression algorithms for predicting, 
and Ren, Z. [3] designed different kinds of features based on 
MOOC courses’ characters, which improved the performance of 
their predictor. In addition to study-related data, social behavior 
data is helpful in predicting [4]. 


In this paper, we describe the performance prediction problem, 
and present models we built. A summary of which features played 
a role in gaining accurate predictions is presented. The most 
fundamental contribution is the design, development and 
demonstration of a performance prediction methodology, from 
raw logging data to model analysis, including data preprocessing, 
feature engineering, model evaluation and outcome analysis. 


2. PREDICTION PROBLEM DEFINITION 


Our SPOC was composed of 3 tutorials and 9 projects in Fall 
2016, learners studied the tutorials from week | to week 6, and we 
released project 0 at week 7. We found it was important for 
learners to move on only after they’d mastered the core concept. 
Students started one project and as they mastered corresponding 


content, that they need to pass the test in class, and then they 
could be awarded to the next project. 


Here our performance prediction is to predict whether the learner 
could pass their test at the end of each week according to their 
study behavior. We define time slices as weekly units. Time slices 
started the first week in which in class test was offered (week 7), 
and ended in the 16" week, after the final test had closed. 


So we could use the logging data from week | to week 6 to 
predict the learners’ performance at week 7. Furthermore, we used 
'lead' represents how many weeks in advance to _ predict 
performance. We assign the performance label (x;, 0 for unpassed 
the test or 1 for passed the test) of the lead week as the predictive 
problem label. 'Lag’ means use how many weeks of historical 
variables to classify. 


3. PREDICTING WEEK PERFORMANCE 


We did not use the non-behavioral attribute such as a leaner’s age, 
gender and others. Instead, we used some features that would 
show different style of learning habits. One type of behavioral 
variables is based on the learner’s interaction with the educational 
resources, including time spent on resources and problem / 
homework. As Colin Taylor described in [5], taking the effort to 
extract complex predictive features that require relative 
comparison or temporal trends, rather than using the direct 
covariates of behavior, is one important contributor to successful 
prediction. For instance, we create an average number of 
submissions per problem for each learner (x9). Then we compare 
a learner’s x9 value to the distribution for that week. Feature x16 
is the percentile over the distribution and x17 is the percent as 
compared to the max of the distribution. We also extracted 
features that related to learners’ study habits. For instance, feature 
to describe whether learners begin doing the problem / homework 
soon after it was released, and features to characterize the learners 
that submit problem / homework in timely fashion or at last 
minute fashion. 


To build predictive models, we utilize a common approach of 
flattening the data- assembling the features from different weeks 
as separate variables. 


We first used logistic regression as our binary predictive model. It 
calculates a weighted average of a set of variables as an input to 
the logit function. There are different coefficients for the feature 
values. For the binary classification problem, the output of the 
logit function becomes the estimated probability of a positive 
training example. 


When applying the logistic regression to learner week 
performance prediction. We used 28 features to form the feature 
vectors, and maintained the week performance value as the label. 
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3.1 Predicting Performance 

When evaluating the classifier’s performance. A testing set 
comprised of untrained covariates and labels evaluates the 
performance of the model as following steps: 


The logistic function learned is applied to each data point and the 
estimated probability of a positive label is produced. And then a 
decision rule is applied to determine the class label for each 
probability estimate. Given the estimated labels for each data 
point and the true labels we calculate the confusion matrix, true 
positives and false positives and then obtain an operating point on 
the ROC curve. Then evaluate the area under the curve and report 
it as the performance of the model on the test data. 


We need to present the results for multiple prediction problems 
for different week simultaneously. Here means for each week 
during our course, we want to predict the students’ week 
performance using different historical data. The heat map of a 
lower right triangular matrix is assembled as shown in figure 1. 
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Figure 1. Logistic regression results 


The x-axis of figure | is the week for which predictions are made 
in the experiment, while y-axis is the number of the how many 
week data we use for the prediction (lag). The color shown the 
area under the curve for the ROC the current model achieved. 


We employed cross validation in all of our predictive modelling. 
Some partitions are used to construct a model, and others are used 
to evaluate the performance. Considering only 377 samples in our 
data set, we employed 3-fold cross validation and use the average 
of the ROC AUC over the folds as evaluation metric. 


3.2 Feature Importance 

We utilized randomized logistic regression methodology to 
identify the relative weighting of each features. As shown in 
figure 2, top features that had the most predictive power include 
whether learners interact with the resources more time 
(max_observed_event_duration), learners’ interaction with the 
problems (average_number_of_submissions_percentile), study 
habits (time_first_attempt, problem_finish_time_pre_start24h, 
problem_finish_time_pre_start4ésh). 
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Figure 2. Relative importance of different features 
across all variants (lag / lead) 


4. SUMMARY 


We have taken an initial step towards identifying at-risk students 
in a SPOC, which could help instructors design interventions. 
Several prediction models are compared, with SVM preferred due 
to its good performance. The noteworthy accomplishments of our 
study when compared to other studies including: we extracted 
variable from the click stream logging data and then generate 
complex features which explain the learners’ study behavior, 
especially how to describe the learners’ study habits. We 
attributed SVM model to those variables as we achieve AUC in 
the range of 0.62-0.83 for one week ahead. 


In the future, we will collaborate with course instructors to deploy 
our predictive models. And we will take more attention to why a 
student is failing, and what strategies make others’ success in a 
SPOC or on-campus course. 
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ABSTRACT 


In a climate where higher education institutions are ac- 
tively aiming to increase inclusivity [2], we explore how a 
deep learning-based tool focused on text analysis is able to 
help assess how students think about issues of privilege, op- 
pression, diversity and social justice (PODS). We created 
a vocabulary boosting and matching tool augmented with 
domain-specific corpora and relevance information. We find 
that the adoption of domain-specific corpora enhances model 
performance when identifying PODS-related words in short 
student-written responses to writing prompts, by building a 
more highly focused PODS vocabulary. 


1, INTRODUCTION AND RELATED WORK 


Universities are expanding their efforts toward creating more 
inclusive institutions of higher education [2]. One specific 
example is the principled blending of curricula with social 
justice and diversity issues in order to encourage PODS 
thinking (Privilege, Oppression, Diversity, Social justice) in 
the School of Social Work at the University of Michigan. 
PODS principles have been emphasized not only in individ- 
ual courses but throughout the whole Social Work curricu- 
lum. Such a move naturally raises the question of scaled 
evaluation, both of individual students (e.g. formative or 
summative assessment) and programmatic evaluation. 


In previous work, we explored mechanisms to detect ele- 
ments of PODS thinking in student writing through semi- 
supervised machine learning [1]. We adopted the Empath 
tool [3] to generate an expanded vocabulary from a few seed 
words for PODS thinking detection, but were extremely lim- 
ited in our ability to achieve accurate results. The first issue 
stems from the selection of large but general corpora which, 
while large in size and topic coverage, were not effective 
when we attempted to learn domain-specific bigrams. The 
other issue is how to filter less relevant words while boosting 
the size of the relevant lexicon. While generating a lexicon 
for Social Justice on Empath, we found that semantically 
irrelevant words like “therefore” and “yet” were in the out- 
put lexicon [1]. Thus, we expand on previous results and 
demonstrate a more robust and thorough treatment of the 
issues of detecting PODS thinking in student writing. 


In this work, we consider the specific case of short student 
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writings given in response to a writing prompt. Our goal is 
to build a technology solution that gives accurately coded 
responses and that enables instructors to identify quickly 
which students need elaborated feedback. The system will 
allow the instructors to focus remediation efforts on those 
who are of the highest need and to assess how well the over- 
all curricula could increase PODS competency of students. 
Here we demonstrate the feasibility of using deep learning 
methods to detect evidence of PODS and apply these meth- 
ods to a particular writing activity, innovating on the process 
used by others [3] to improve accuracy and reliability. 


2. INSTRUMENTS 


We created Metapath, a text analysis tool that allows users 
to use not only general corpora but also domain-specific cor- 
pora. Metapath is built on the ability of the Word2Vec 
model to calculate the similarity of concepts by mapping 
words and phrases to a vector space via a skip-gram model, 
and computing the cosine similarity of the corresponding 
vectors [4]. Given a word, the model gives users a ‘most simi- 
lar’ word list ordered by the similarity score. In a preprocess- 
ing step, short words (length < 2), non-English terms, and 
most stopwords are considered as noise and removed from 
the corpora. After data cleaning, all words are stemmed 
using Porter stemming. Common phrases, i.e., multiword 
expressions, can be detected automatically by calculating 
mutual information gain within a threshold and minimum 
count. For example, the words ‘Los Angeles’ will become the 
phrase los_angeles after phrase detection while the model 
will return a list of high similarity words like san_francisco 
and santa_barbara. The judgment of whether the words are 
common phrases is based on the formula 


cnt(a, b) — min_count 


-N 
ent(a) « ent(b) > threshold 


where cnt(a,b) means the frequency of word a and word 6 
located together and N is the total vocabulary size. 


We chose to use domain-specific corpora, i.e., MICUSP 
(Michigan Corpus of Upper-level Student Papers) and 
BAWE (British Academic Written English) [5], for detecting 
common phrases. ‘The general Wikipedia corpus is used to 
train the model. In addition, considering the contextual na- 
ture of the PODS words, existing student responses gathered 
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from courses were included as a corpus. The domain-specific 
corpora are able to detect more related phrases on the top- 
ics of interest. For example, the proportions (10~°%) of 
stemmed words like ‘prejudic’ and ‘social_justic’ in domain- 
specific corpora were relatively high (respectively 0.079 and 
0.015), compared to the proportions of the same words in 
the general corpora, which were much lower (0.012 and 0). 


3. EVALUATION 


We conducted an evaluation to assess how well Metapath 
can assess PODS-related writing, using our domain-specific 
corpora, along two dimensions: comparing (1) inter-rater re- 
liability (IRR) for PODS word annotation between human 
raters and Metapath and (2) IRR for quality evaluation be- 
tween human raters and Metapath. The latter method is 
to include percentage of relevance of PODS words, which 
shows how semantically related each word is to seed words. 


3.1 Data 


The students’ short written responses on PODS topic were 
used to evaluate Metapath, collected from four sections of a 
course offered in the School of Social Work (n = 100, word 
counts; Z = 695.52, 0 = 434.08, min = 115, max. = 2747). 


3.2 Approaches 

For the evaluation, two expert human coders annotated 
PODS-related words in the student responses and evaluated 
overall PODS-relevance of each writing piece with three dif- 
ferent marks: high, medium, and low. Their annotations 
and quality evaluation on student responses were compared 
with result of Metapath. ‘To build a lexicon to evaluate 
PODS relevance of student writing, Metapath was boosted 
by essential PODS words, i.e., privilege, oppression, diver- 
sity, and social justice. Furthermore, two keywords from 
the writing prompt, i.e., “issues” and “actions”, were also 
used to boost the PODS lexicon. After we boosted a lexicon 
(dim=500), the lexicon was used to calculate the IRR on an- 
notations among two human raters and Metapath. The lex- 
icon and its percentage of relevance were used to assess the 
overall PODS relevance of each response. After all the re- 
sponses were ranked based on their percentage of relevance, 
they were categorized into high, medium, and low. The 
threshold of the each category was based on the proportion 
of each category decided by the human raters. 


4. RESULTS AND DISCUSSION 


We calculated group agreement among the two human raters 
and Metapath using Krippendorff’s alpha (qa). For the an- 
notation comparison, IRR among two human raters alone is 
a = 0.4480 (n = 100). When we added Metapath the overall 
group agreement dropped to a = 0.3804 (responses = 100, 
boosted words = 4300, the maximum and minimum possi- 
ble agreement the 3-rater scenario: —0.4056 < a < 0.6324). 
IRRs between each human rater individual and Metap- 
ath were a = 0.1622 and a = 0.1822. For the quality 
evaluation, we achieved a = 0.3441 (responses = 100, 
boosted words = 660) as the level of agreement between 
human raters and Metapath, which is close to the IRR be- 
tween the two human raters (a = 0.4393, the maximum 
and minimum possible agreement among 3-rater scenario: 
—0.1875 < a < 0.6223). IRRs between each human rater 
individually and Metapath were a = 0.3702 and a = 0.2234. 
Overall, the evaluation showed that Metapath could iden- 
tify PODS-related words and overall PODS relevance. The 
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IRR that Metapath reached was close to those of human 
raters and not too low, considering the possible minimum 
and maximum agreement range. 


It is worth pointing out that higher agreements in PODS 
word detection do not align with higher agreements in over- 
all PODS relevance. We varied the size of Metapath’s vo- 
cabulary by 500 words through setting the number of boosted 
words parameter. Even quite large vocabularies boosted the 
effectiveness of Metapath in the first task, declining only 
when values reached n & 4000. However, the IRR for qual- 
ity analysis was the highest when n = 660. 


Further research is needed to explore and improve the perfor- 
mance of Metapath. While identifying PODS-related words, 
there are still words and phrases in the field of social work 
that are not detected by Metapath, as noted by the experts. 
One way to address this is to focus on improved corpora, 
such as increasing the amount of response data generated 
by social work students and articles or books curated by 
PODS experts, or by using corpora based on accumulated 
Social Work student’s writing. Finally, we note that this 
task is highly multifaceted, and here we have taken just 
a first pass at addressing it. Issues of personally-lived ex- 
periences, intersectionality of topics, and the nature of the 
writing prompt itself may require more traditional natural 
language processing techniques in order to capture deeper 
relationships in the text more fully. 
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ABSTRACT 


It is widely understood that students learn in a variety of 
different ways and what is beneficial for one student may 
not necessarily help another. This work observes the effec- 
tiveness of Causal Forests as they compare to a new method 
we present called Naive Causal Forests. This new method, 
aimed to be a simpler, more intuitive approach to identifying 
heterogeneous effects, is developed to better understand the 
strengths and limitations of the Causal Forest method. We 
apply these techniques to real student data on three RCTs 
run within the ASSISTments online learning platform. 


Keywords 
Personalization, Heterogeneous Treatment Effects, Random- 
ized Controlled Trials, Causal Forest, Random Forest 


1. INTRODUCTION 


The idea that students approach learning in differing ways 
is not a new concept to researchers in the field of educa- 
tion, but how to leverage these computer-based systems for 
individualized learning is not always clear. Individualiza- 
tion, also referred to as personalization, also exists outside 
the field of education as well. In other fields, this idea is 
described through heterogeneous treatment effects, as the 
effect of a particular treatment or intervention is not of- 
ten homologous across all individuals. The introduction of 
computer-based systems in the classroom makes it feasible 
to supply aid to individuals allowing the teacher to focus on 
helping those students struggling most. 


Recently, a technique known as a Causal Forest (CF) [8] 
has been developed, applying random forests to the task 
of identifying heterogeneous effects. This work explores a 
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new, more intuitive method for identifying heterogeneity as 
it compares to the more complex CF method. This new 
method, called Naive Causal Forest (NCF), attempts to em- 
ploy a simpler approach based on the structure of CF to 
answer: 1. To what extent, if any, does the Causal Forest 
method outperform our simpler, more intuitive approach to 
identifying heterogeneous treatment effects in real student 
data? and 2. Do these models converge to large differences 
when compared using increasing sample sizes? 


2. DATASET 


The dataset used to build and evaluate our method is com- 
prised of student information on 3 randomized control trials 
(RCTs) run within the ASSISTments online learning plat- 
form [2] from a previously published dataset [5]. ASSIST- 
ments is a free web-based platform where a recent efficacy 
trial found the system to be effective in improving student 
learning [4], motivating further study to better understand 
student behavior and measure effects within the platform. 


After filtering the data to remove students with missing val- 
ues, the Experiment 1 contains 519 students, the Experi- 
ment 2 contains 833 students, and Experiment 3 contains 
1118 students. 


3. METHODOLOGY 

The Causal Forest (CF) method [8] has established itself as a 
viable model for identifying heterogeneous effects, for which 
we do not refute, but rather we wish to explore the benefits 
of this more complex method to a simpler, more intuitive 
approach. CF uses estimates of treatment effects within 
the splitting rule of a random forest algorithm; an “honest” 
variant uses a holdout set to estimate the effect for each 
split. Heterogeneous effects can be determined by observing 
students who then are grouped into different leaves of the 
generated trees. Our new method, which we have called 
Naive Causal Forest, aims to implement a simpler approach 
that excludes the use of condition from the random forest 
until students are grouped into each leaf, where then an 
average treatment effect is calculated across each subgroup. 
In both methods, each tree has a “vote” as to what condition 
will benefit the students most. 
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Figure 1: The 10-fold cross validation results for experiments 1 and 2 comparing NCF to an honest CF model. 
No reliable differences are found between the two methods, and both appear consistent with increases to the 


number of generated trees. 


Experiment 3 Bootstrapping Results 
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Figure 2: Experiment 3 bootstrapping results com- 
paring NCF to two Causal Forest models. 


We compare CF, implemented in R [3] using a Causal Tree 
package [1], and NCF in their ability to identify heteroge- 
neous effects for the purpose of maximizing completion of 
the assignment. We calculate the Odds Ratio [7] within 
each leaf to identify which condition corresponds with the 
higher student completion rate within each leaf. We eval- 
uate our models using a measure known as policy risk [6], 
where a lower value indicates better performance. This met- 
ric is used to compare the two methods for each experiment 
as the metric is not directly comparable across experiments. 


4. DISCUSSION AND FUTURE WORK 


The result of our 10-fold cross validation analysis can be seen 
in Figure 1. Both models use a minimum leaf size of 30, and 
are evaluated over several model complexities. In all three 
experiments, it is found that the CF and NCF model exhibit 
no reliable differences. It is also the case, however, that no 
significant heterogeneous effects are found by either method. 
Figure 2 illustrates how the methods converge with increas- 
ing sample sizes using a bootstrapping method of sampling 
with replacement on the largest experiment. 


We compare in this work the Causal Forest method for iden- 
tifying heterogeneous treatment effects to our Naive Causal 
Forest method and find no reliable differences between the 
simpler and more complex methods. It is expected, and 
planned for future work, that applying these methods to 
experiments with larger sample sizes may show statistic re- 
liability. 
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We also found that the CF model exhibited stable policy 
risk over increases to model complexity. This is a desirable 
quality of a prediction model, as it is data driven and less 
sensitive to changes in model structure. We found that the 
CF model exhibited non-converging behavior when boot- 
strapping, but may additionally be caused by insufficient 
variation or lack of heterogeneity in the dataset. 
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ABSTRACT 


Heterogeneous treatment effects occur when the treatment affects 
different subgroups of population differently. In this work, we 
conducted a large scale simulation study to identify the 
characteristics of treatments that are more likely to have 
heterogeneous treatment effects, and to estimate how effective the 
individual treatment rules are compared to the better conditions. 
We found that heterogeneous treatment effects are rare. When the 
overall treatment effect is close to zero, we found that individual 
treatment rule is very likely to be effective. With large positive or 
negative overall treatment effect, the heterogeneous treatment 
effect is less likely to occur, and the individual treatment rules are 
more likely to be ineffective. 


Keywords 


Heterogeneous Treatment Effect; Individual Treatment Rule; 
ASSISTments; Randomized Controlled Experiment. 


1, INTRODUCTION 

Researchers have been using randomized controlled experiments 
(RCT) to test their interventions. RCTs are considered the gold 
standard and are widely used in many fields, from healthcare to 
education. Traditionally, researchers often look for treatment 
effects across the population. However, in many experiments, the 
treatment effect differs systematically from one subgroup of the 
population to another. For example, patients who are allergic to the 
treatment drugs may react negatively instead of benefiting from the 
drug. This type of effect is often called heterogeneous treatment 
effects, as there are different effects for different types of people. 
Many machine learning methods have been developed to detect 
heterogeneous treatment effects. For example, [4] introduced the 
Causal Forest, a decision tree-based method to determine the 
treatment effect on each subgroup of the population. 


In many cases such as [1], it is better to tutor students with lower 
prior knowledge using step-by-step hints, while it is better to tutor 
students with high prior knowledge with full problem solutions. In 
this case, giving personalized tutoring to each student is better than 
giving the same tutoring to everyone. This type of condition 
assignment is often called an individual treatment rule or a 
personalization policy. 


In order to evaluate a personalization policy, the most popular 
method is to deploy the policy in real time and compare the result. 
However, the on-line method is often costly and sometimes 
unavailable to the researchers (e.g. because the data have already 
been collected). As a result, many researchers conduct an offline 
policy evaluation using past data. In [3], they use the expected 
outcome of the policy to evaluate their personalization policy. To 
calculate the expected outcome using past RCT data, we must first 
find a subset of subjects whose random condition assignments 
during the RCT matches the personalized condition assignments of 
the policy. The expected outcome of a personalization policy is the 
average outcome of this subset across conditions. Comparing two 
policies using the expected outcome easy and intuitive; if the larger 
outcome values are better, the policy with larger expected outcome 
is better. This method is equivalent to policy risk introduced in [2]. 


The main goals of this work are 1) to find the characteristics of the 
experiments that are more likely to have heterogeneous treatment 
effects, and 2) to compare a personalization method, specifically 
Causal Forest, against assigning every subject to the best conditions 
to find out how effective a personalization policy can be. 


2. METHODOLOGY 


In order to gain a better understanding of expected outcome, we 
investigated how it is calculated in [3]. They first took the subset of 
the subjects from the RCT whose random condition assignments 
are the same as the condition assignments given by a 
personalization policy. For the rest of this paper, we will refer to 
this subset as the “congruent subset”. Then, the expected outcome 
of the policy is calculated by taking the average outcome values of 
the congruent subset regardless of conditions. For example, in 
Table 1, the congruent subset consists of subject 1, 3, 4, and 5, and 
the expected outcome of the policy is (0.7 + 0.4 + 0.6 +0.7)/4 = 0.6. 


2.1 Simulation Study 


We conducted a large-scale simulation study to verify the 
effectiveness of using the congruent subset as an estimate of real 
outcome values of the policy, and to find types of experiments that 
are likely to have personalization. We chose simulation study 
because it allows us to not only calculate the real outcome values 
of the policy, but also investigate how different settings impact the 
personalization. 


Table 1: an example data to show how congruent subset works 


RCT personalized | Is in congruent 
3 ect outcome 
oo oon subset? 
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Table 2: Different Distributions for Effect of Conditions 
combinations 
0.1.2.5, 10 - 
eu 
0, 0.5, 1, 2 
6 


0.5, 1,2 
p total 


For the simulation study, we focused only on experiments with two 
conditions. For each condition, we simulated 46 different settings, 
as shown in Table 2, resulting in 46 * 46 = 2116 different 
combinations of experiments. We also include lognormal 
distributions and gamma distributions because real datasets may 
not always follow normal distributions, for example the mastery 
speed in [5] resembles lognormal distribution. For each setting, we 
generated 1000 datasets, each of which has 1000 data points. 


Every data set has 3 covariates: one with a positive, negative, and 
no effect on the outcome. Every covariate value is generated 
independently for each subject from a normal distribution with 
mean = O and sd = 1. The true effect is generated using the 
distribution and parameters in Table 2. The observed outcome is 


observed = effect + covl * impactl — cov2 * impact2 + noise 


The impacts are from uniform (0,5) and remains constant within 
experiment. The noise is drawn from a normal (0,1) distribution. 


For each personalization policy, we measured 1) if the outcome 
values of congruent sets are significantly different from the 
outcome values of actually assigning everyone using 
personalization policy, and 2) whether the personalization from the 
Causal Forest is better than the better of the two conditions. 


3. RESULTS 


From 2,116,000 simulated dataset, we detected the significant 
difference between the outcome values of the congruent sets and 
the real personalized outcome values less than 1% of the time, 
which is far lower than the threshold of 5%, regardless of 
parameters of the dataset. As for the effectiveness of the Causal 
Forest, we look at how often the personalization suggested by 
Causal Forest are better than assigning subjects to the better of the 
two conditions. We found that personalization is slightly more 
common when at least one of the distribution is gamma distribution. 


Table 3: the Effectiveness of Personalization Suggested by 
Causal Forest by Overall Observed Treatment Effect 


Causal Forest’s 
observed suggests personalization is 
treatment effect personalization the most effective 


Rounded average Causal Forest 


2 


Table 3 shows that when the treatment effect is close to zero, the 
personalization suggested by the Causal Forest is very effective. 
Causal Forest policy is better than assigning subjects to the better 
of the two conditions more than 3/4 of the times when the treatment 
effects are between -1 and 1. The effectiveness of the 
personalization quickly drops as the treatment effect is far from 
zero. It is important to note that the Causal Forest we used in this 
study has never been optimized and most of parameters we used are 
default, except the two we specified earlier in the paper. 


4. CONCLUSION 


This paper has three main contributions. First, we promoted the 
study of heterogeneous effects and an offline personalization policy 
evaluation method to the Educational Data Mining. Second, we 
investigated several different settings of simulated experiments to 
find the characteristics of the experiments that are more likely to 
have heterogeneous treatment effects. We found that, generally 
heterogeneous treatment effects are not common and typically rare 
when the treatment effects are very large or very small. Third, we 
investigated the effectiveness of personalization policies given by 
Causal Forest. We found that the personalization policy is likely to 
be effective for the experiments with small treatment effects. 


5. FUTURE WORK 


We plan to investigate different methods for detecting 
heterogeneous treatment effects on real dataset from ASSISTments 
to see if we can detect more experiments like [1]. If we can detect 
such effects, we would be able to improve our system even further, 
which will improve student learning. 


We also plan to compare different methods for detecting 
heterogeneous treatment effects to see what are the advantages and 
disadvantages of each model. We also plan to compare these pre- 
train models to real-time methods like bandits as well. This result 
will allow us to be able to choose the right tool for the right 
personalization task. 
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ABSTRACT 


In this preliminary study, we introduce MyCOS Intelligent Teaching 
Assistant (MITA). It is an open learning platform tailored for a 
specific challenge of Chinese universities, 1.e., undergraduates report 
less student-faculty interaction than those in the U.S.. Compared 
with existing classroom tools like Socrative, MITA leverages the 
app-within-an-app model of WeChat (the largest social app in China) 
instead of a stand-alone app. Which model is the future is debatable. 
MITA also uses prompt feedback to engage learners and dashboards 
to inform teachers and administrators. It now serves more than 3,200 
teachers and near 110,000 students from 600+ Chinese universities. 
What the data from the platform reveal about learning deserves 
further study. 


Keyword 


Open learning platform, student engagement 


1. INTRODUCTION 


Researchers found that the gap in student-faculty interaction (SFI) 
between Chinese universities and their American peers. Based on a 
comparative study of 2009 National Survey of Student Engagement 
(NSSE) results, 27% Tsinghua (a Chinese research university) 
undergraduates had never received prompt feedback from faculty 
on academic performance while the average in the American 
research universities was 7% [1]. 


MyCOS Intelligent Teaching Assistant (MITA) is an open learning 
platform tailored to the context of Chinese universities. Different 
from existing tools such as Socrative, MITA enables teachers to 
interact with students through the app-within-an-app model of 
WeChat (the most popular social app in China). Whether this model 
is better than a stand-alone app to engage college students is 
debatable. It would be interesting to explore similar learning tools 
that leverage Facebook or other social apps in different countries and 
then compare. 


Inspired by the 2011 proposal of open learning analytics [2], MITA 
tracks learner behaviors and provides prompt feedbacks. It has data 
dashboards for teachers (see Figure 1) and administrators to 
monitor learning process and take informed actions. Since launched 
in September 2016, MITA has been used by more than 3,200 
teachers and near 110,000 students in 600+ Chinese universities. It 
is a real case of collaboration across research, industry and 
education sectors. The fast development and nationwide 
deployment of MITA can produce data useful for further study. 


The rest of the poster sections is organized as follows. In section 2 
we describe the data sample; in section 3 we report the learning 
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behavior patterns the data reveal; in section 4, we discuss the need 
for further analysis. 


2. DATA SAMPLE 

The sample used in this preliminary study was selected from MITA 
clickstream data between 2016/09/10 and 2017/02/06. During the 
time period, 1,599 teachers and 45,383 students registered. Among 
them, 766 teachers and 32,305 students have verified their institute 
information and interacted through MITA at least once. They are 
defined as active teachers and active students in this study. 


To assess student engagement, we focus on the related learning 
patterns the MITA data reveal. Specifically, the patterns discussed 
below (in section 3) are student attendance, quiz participation and 
questions answered. 


The sample covers 278 Chinese universities, including 199 fouryear 
universities (71.6%) and 99 three-year vocational colleges. 


3. BEHAVIOR PATTERNS 
3.1 Student Attendance 


Existing studies on student attendance were limited within an 
institution, e.g., a 2015 research on 2,141 classes of a four-year 
Chinese university found the average attendance rate of 89% [3]. The 
student attendance pattern based on the MITA sample extends to 
nationwide and the numbers fall within a reasonable range. The 
average attendance rate is higher in three-year vocational colleges 
(92.8%) than that of four-year universities (89.2%). 


Daily attendance behaviors demonstrate a similar pattern: the 
attendance rate of three-year vocational colleges is higher than that 
of four-year universities every weekday except Friday. The lowest 
daily attendance rate for three-year colleges is on Friday (88.9%) 
while for four-year universities 1s on Monday (87.9%). Hourly 
attendance behaviors show a common challenge for both categories 
of universities: classes scheduled in the evening (6-9 pm) have the 
lowest attendance rates (85% for three-year vocational colleges and 
83.9% for four-year universities). 


3.2 Quiz Participation 

Quiz participation is one of indicators used by researchers to monitor 
online learning behaviors [4]. MITA enables us to conduct the similar 
learning analysis in a real classroom. When students take a quiz in 
class by MITA, they can view the progress in realtime and get the 
feedback immediately after submission. With the fine-grained data, 
the teacher can check who participate, who get the answer wrong and 
which part of the course content is most challenging. 
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Based on the MITA sample, the quiz participation rate on average is 
84.5% for 3-year vocational colleges and 81.7% for 4-year 
universities. Both are higher than the quiz participation rate in 
MOOCs. A 2014 study found that 40%~70% learners completed 
zero quiz in two live-MOOCs (i.e. in-session, instructor-led course 
with possibility of obtaining a statement of achievement) [5]. 


3.3 Questions Answered 

Asking questions is one of teaching strategies used in college 
classroom. In a 2013 study, a researcher observed 30 English classes 
in a four-year Chinese university for two months. She also surveyed 
25 teachers and 237 students to analyze the behaviors of asking and 
answering questions in class [6]. Data collection becomes more 
efficient with MITA. Based on the MITA sample data, nearly half 
teachers in three-year vocational colleges (51.7%) use MITA to ask 
questions in every class session. The proportion is lower in four-year 
universities (41.6%). 


The proportion of answering questions, however, is quite low for 
students. The MITA data show that 96.7% students in three-year 
vocational colleges and 98% in four-year universities never 
answered a question in class. The result looks plausible given the 
large class size in the sample: 36.8% classes in three-year vocational 
colleges and 47.2% classes in four-year universities are larger than 
50 students. It indicates that some alternative strategy (e.g., an open 
question in a quiz) can engage more students. 


4. DISCUSSION 


The focus of this preliminary study is to enhance student-faculty 
interaction in a real classroom. Besides, MITA has the data on 
learning behaviors before class (e.g. viewing the course PPT) and 
after class (e.g. submitting an assignment) for further exploration. 


Further study is using EDM & LA (e.g. user behavior modeling) to 
explore the MITA data in terms of student motivation, performance 
and satisfaction. More clickstream data (e.g., the number of 
attempts students try with a quiz) can be collected and analyzed. 
Different learning patterns can be compared across not only 
institutional type (four-year universities vs. three-year vocational 
colleges) but also class size (small, medium and large) or course 
type (required courses vs. elective courses). The comparison can 
provide actionable information for teachers and administrators. 


Based on the 2015 IMPACT report from Purdue University, nearly 
half faculty (48%) chose the ICT-supplemental learning model to 
redesign their courses, 46% chose the hybrid or flipped model and 
only 6% chose online-only [7]. It indicates the possibility of 
developing and deploying MITA or similar learning tools for a real 
classroom in different countries. Experiments of Facebook in 
classroom has been explored in the U.S. [8], Canada [9], and 
Singapore [10], but more third-party applications like MITA are 
needed to extend the capability of Facebook as a learning tool and 
more debate on whether we should ban or embrace using such a 
tool is ongoing. 
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Figure 1. Teacher Dashboard of MyCOS Intelligent Teaching 
Assistant (MITA). 
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ABSTRACT 


The automatic classification of LOs into different categories 
enables us to search for, access, and reuse them in an effective and 
efficient way. Following this idea, in this paper, we focus 
specifically on how to automatically recommend the classification 
attribute of the IEEE LOM when a user adds a new LO to a 
repository. To do it, we propose the use of the multi-label 
classification approach, since each LO might be simultaneously 
associated with multiple labels. An initial problem we have found 
is that the number of terms or pure text features that characterize 
LOs tends to be very high. So, we propose to apply a 
dimensionality reduction process. We have carried out an 
experiment using 515 LOs from the AGORA repository in order 
to try to reduce the number of features or attributes used, 
improving execution time without losing prediction accuracy. 


Keywords 


Multi-label classification, feature selection, learning object 


1. INTRODUCTION 


The IEEE Learning Object Metadata standard (IEEE LOM) 
defines several attributes that may be assigned to each Learning 
Object (LO). However, manual entering all these metadata is a 
time-consuming process and automated techniques are required 
for a wider adoption of the standard [2]. In this paper, we focus 
on how to automatically recommend the classification attribute of 
the IEEE LOM when a user adds a new LO to a repository. Our 
idea is to recommend the user what are the possible categories 
that a LO belongs to from just user-provided information about 
the LO (such as the title, keywords and description). In order to 
do it, we propose to use multi-label classification for automatic 
categorization of LOs from the terms or pure text features that 
characterize these LOs. Multi-label classification (MLC) is a 
variant of the classification problem where multiple target labels 
can be assigned simultaneously to each instance [1]. In traditional 
classification classes are mutually exclusive, that is, a specific 
instance can belong to just a single class. However, there are 
occasions where classes present overlapping, that is, a specific 
instance can belong to several classes. In our case, we use MLC 
because a specific LO could belong to several categories. 


2. PROPOSED METHODOLOGY 


Our proposed approach for automatically classifying of LOs 1s 
represented in figure 1. First, we create the data file starting from 
the terms or pure text features that characterize LOs extracted 
from the LOs metadata, and categories to which the LO belongs 
to. Therefore, our next step consists in performing an attribute 
selection. The final step is the application of a MLC algorithm 
that will give us a model for classifying new LOs. 


Creating data fite Preprocessing 


Generating 
Input 
—> Attributes 


Adding Class/ 
LOs metadata 


Data Mining 


Recommendation 


Multi-label { 
C lassi ication 1 
algerithe 


( Model }——> Cateaories 


—_ 


OFF-LINE ON-LINE 


Figure 1. LO multi-label classification approach. 


3. EXPERIMENTAL WORK 


The data file used in this work has been extracted using 515 LOs 
from the AGORA repository [3] as follows. When a user adds a 
new LO to AGORA, he must provide information such as title, 
keywords, description and other related IEEE LOM metadata. 
Starting from these information about all the LOs we extracted 
1336 terms (features) after removing stop words and stemming (to 
reduce the terms to their roots). Next, we compute the frequency 
of these roots for the LO at issue obtaining its term frequency 
(TF) representation. So, we obtained an example-term matrix, in 
which each element represents how many times a term appears in 
an example. We also normalized the count to term frequency to 
measure the importance of a term. Besides, in AGORA, a user has 
to specify one or several categories to which the LO belongs to 
from a predefined set of five academic disciplines: Engineering 
and Technology; Natural and Exact Science; Social and 
Administrative Science; Education, Humanities and Art; Health 
Science. So, we added the 5 labels (in binary format) to each LO 
as classes to predict. Then, we applied a dimensionality reduction 
process for reducing the number of attributes in the dataset. The 
motivation is to reduce training and classification times and 
removing noisy and irrelevant attributes, which can have a 
negative impact on accuracy results. Usually, there exists a wide 
range of possible terms that can refer to LOs of very different 
topics, and hence, the number of attributes describing LOs tends 
to be very high. Feature selection has been performed according 
to a specific method for MLC suggested in [5]. First, the 7 
feature ranking method was separately applied to each label. Thus, 
for each label, the worth of each attribute is estimated by 
computing the ¥” statistic with respect to the label to determine its 
independence. The core idea is that, if an attribute 1s independent 
on a class, this attribute could be removed. The result of this step 
is a ranking of all features for each label according to the statistic. 
Finally, the top-n features were selected based on their maximum 
rank over all labels. Finally, 13 different state-of-the-art MLC 
algorithms [1] have been applied to the different versions of the 
data set. They include 3 adaptation algorithms: AdaBoost.MH, 
Multi-Label k-Nearest Neighbor (MLkKNN) and Instance-based 
Logistic Regression (IBLR), and 10 transformation algorithms in 
which the J48 implementation of C4.5 decision tree algorithm has 
been used as base classifier: Binary Relevance (BR), Classifier 
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Chais (CC), Calibrated Label Ranking (CLR), Label Powerset, 
Prued Sets (PS), Ensemble of Pruned Sets (EPS), Ensemble of 
Classifier Chains (ECC), Random-k-LabelSets (RAKEL), 
Hierarchy Of Mul-tilabel classifi—ERs (HOMER) and Stacking. 
The MULAN software for MLC [4] has been used for running 
both the feature selection method and the MLC algorithms. We 
have used a 10-fold cross validation with 10 seeds. Our 
experimentation takes into consideration two main factors: 
number of attributes and MLC performance. Overall, the time 
employed by a MLC algorithm to generate a model will be 
proportional to the number of training instances and the number 
of attributes describing each instance. So, if we reduce the number 
of attributes then the computational cost will be reduced as well. 
However, as a reduction of the number of attributes could discard 
relevant information, the induced model could perform poorly. 
This is why we have performed an attribute selection with 
different reduction levels in order to determine the more suitable 
reduction level without damaging the classification performance. 
Our original data set contains 515 LO instances, each one 
characterized by 1336 attributes. From these, we have selected 
1000, 750, 500, 250, 150, 100 and 50 attributes with highest 
ranking to create different datasets. Next, we have applied 13 
MLC algorithms to each different version of the data set, in order 
to know if there are differences in computational costs and 
performance by checking some evaluation measures. Therefore, in 
addition to train time the next five multi-label evaluation 
measures have been computed: a) Example-based metrics: 
Hamming loss (H-loss) and Accuracy (E-Acc) b) Label-based 
measures: Accuracy (L-Acc) and c) Ranking-based measures: 
Ranking loss (R-loss) and Average precision (A-Pre). On the one 
hand, we have found a significant reduction of computational 
costs as the number of features decrease (Figure 2), especially up 
to 250 features. The algorithms reducing training time at higher 
degrees are ECC, RAKEL and EPS. 
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Figure 2. Training time (milliseconds). 


On the other hand, in order to compare the classification 
performance of the algorithms, a Friedman test has been carried 
out for each evaluation metric by considering results for each 
feature reduction level. Ranking values and p-values are detailed 
in Table 1. These p-values (< 0,05) show significant differences 
between reduction levels with high confidence level (95%). We 
can also observe that for Ranking loss (R-loss) and Average 
Precision (A-Pre), the best ranking value is obtained for 1000 
features instead of the original 1336 features. Besides, a meta- 
ranking (the rank of rank) of reduction levels was built performing 
another Friedman test. This way we can evaluate which number of 
features has the best overall performance in most of the metrics. 


The last column of Table 1 shows the resulting meta-rank. It 1s 
interesting to see that the best ranking does not correspond to the 
complete feature set. As the test detected significant differences 
between reduction levels (p-value < 0,01), a Bonferroni-Dunn test 
was performed. This test found that algorithms performed 
significantly worst with less than 250 attributes at 95% confidence 
level. So, we established 250 as the optimum reduction level. 


Table 1. Avg. rankings for all metrics and reduction levels. 
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Finally, a comparison of 13 MLC algorithms when using the 
optimum reduction level (250 features) has been performed. The 
goal was to identify which algorithm yields the best results in this 
specific dataset considering the previous 5 evaluation metrics. The 
algorithm with the overall best results in the five evaluation 
measures (higher in E-Acc, L-Acc and A-Pre; and lower in H- 
Loss and R-Loss) was RAKEL. So, this algorithm will be used in 
our proposed approach for recommending the categories to which 
the new LOs belong. In the future we want to use more evaluation 
measures and also information about LO usage in order to try to 
improve classification performance. 
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ABSTRACT 


Over the past 50 years, an increasing proportion of student 
graduating high school attend college, but literacy levels in 
the United States have remained largely unchanged. We 
present preliminary results that suggest the literacy levels 
of assessed first year college freshmen are above 5th grade 
but below 12th grade, that only 32% of these freshmen are 
reading at a 12th grade level, and that this high-performing 
group has only a 69% chance of passing the reading portion 
of the GED high school equivalence test. 


Keywords 
adult literacy, higher education, NAEP, TABE 


1, INTRODUCTION 


The percentage of high school graduates immediately at- 
tending college has steadily increased from 60% in 1990 [5] 
to 69% in 2015 [2]. However, during this same period the av- 
erage reading score of 12th grade students on the National 
Assessment of Educational Progress (NAEP) has declined 
slightly, such that in 2015, only 37% of students were deemed 
proficient readers [6]. If all proficient readers immediately 
attend college, then only 54% of college freshmen are pro- 
ficient readers. Accordingly, the remaining 46% of college 
freshmen are either basic or below basic readers. 


While it is alarming to think that approximately half of 
college freshmen are not proficient readers, the NAEP pro- 
ficiency criteria and cut scores are not without controversy 
[1]. For example, in a recent mapping of NAEP standards 
to state standards for 8th grade reading (the highest grade 
available), only one state was found to have standards aligned 
with NAEP’s proficient category. Given the controversy, it 
is not clear if the NAEP standards are too high or the state 
standards are too low. 
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To better understand the relationship between NAEP read- 
ing scores and college freshmen reading ability, we conducted 
a pilot study using questions from the Reading section of the 
Tests of Adult Basic Education (TABE). The TABE [3, 4] 
is useful for exploring the question of reading proficiency of 
college freshmen because i) TABE items have national norms 
and are aligned with grade equivalences, allowing us to cate- 
gorize freshmen reading ability according to grade level and 
ii) TABE can be used to predict General Educational De- 
velopment (GED) test performance, which is a proxy for 
determining whether a participant’s reading ability is high 
school equivalent. 


2. METHOD 


2.1 Participants 

Participants (N = 1062) were recruited through the psy- 
chology subject pool at an urban university in the southern 
United States in two waves of online data collection. The 
first wave (N = 313), which took place during the spring 
semester of 2015, was conducted as a regular online study, 
but the second wave (N = 749), which took place during 
the fall semester of 2015, was conducted as a screening com- 
ponent for the entire subject pool. Subject pool screen- 
ing is used to determine eligibility for other studies later 
in the semester and therefore represents an even more di- 
verse group of participants, as it largely eliminates the self- 
selection bias of experimental sign up. No demographics of 
participants were collected. 


2.2 Materials 

Ten items (#£4-13) were selected from the nationally-normed, 
TABE 10 Form D Reading Survey. Form D (Difficult) is de- 
signed to assess reading ability in grade ranges 6.0 - 8.9 and 
therefore may seem a less obvious choice for assessing college 
freshmen. However, Form D items cover the widest range 
of grade equivalents (grades .7 - 12.9) of all TABE 10 forms 
and therefore has some additional utility when the underly- 
ing grade level is unknown. Because the 10 items used in the 
present study were selected from the 25-item TABE 10 Form 
D Reading Survey, the distribution of grade equivalents for 
items does not match the distribution of the complete survey 
and instead falls into three clusters: five items are at grades 
A-5 (3.9, 4.4, 4.8, 5.1, and 5.2), three items are at grades 
11-13 (11.4, 12, and 12.9), and two items are at grades 6-7 
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(6.2 and 7). All items had multiple choice format with four 
response options. 


2.3 Procedure 

Participants completed the informed consent and the 10 
items using a web browser. Because the study was online 
and not proctored, the time guidelines of the TABE (ap- 
proximately 1 minute per question) were not enforced, and 
due to technical problems, the time participants spent on 
the items could not be determined. Participants read each 
of three text passages in turn and answered three to four 
items after each passage by selecting a multiple-choice re- 
sponse option. 


3. RESULTS 


Overall, 75% of participants answered 80% or more items 
correctly, suggesting that the 10 items were overall too easy, 
as recommendations for TABE specify that participants an- 
swer 40% to 75% of the items correctly [4]. Participant 
performance varied across item difficulty cluster, however. 
While 73% of participants answered all five items correctly 
in the 4-5th grade cluster, only 32% answered all three items 
correctly in the 11-13th grade cluster. Furthermore 30% of 
participants answered one item or less correctly in the 11- 
13th grade cluster. Using the TABE guidelines above, this 
differential cluster performance suggests that 4-5th grade 
items are too easy but that 11-13th grade items are too 
hard for the participants assessed. 


These results may also be considered in terms of scale scores 
and GED equivalence. According to previous work mapping 
TABE Reading scale scores to GED Reading test scores [3], 
a TABE scale score of 523 corresponds to the passing GED 
score of 450. Scale scores for each item cluster and items 
overall were calculated and compared to the GED crite- 
rion. Only participants who answered all 10 items correctly 
(248 participants) or all of the 11-13th grade items correctly 
(335 participants) surpassed the GED criterion. Using the 
TABE-GED mapping [3], participants who answered all of 
the 11-13th grade items correctly had a 69% chance of pass- 
ing the GED Reading test. Thus while 32% of all partici- 
pants answered the 11-13th grade items correctly, only 22% 
of all participants are likely to pass the GED Reading test. 


4. DISCUSSION 

Our preliminary results suggest that college freshmen read- 
ing ability overall is between 5th and 12th grade. This find- 
ing is plausible given NAEP results that only 37% of 12th 
grade students are proficient readers [6]. The lack of a more 
specific grade-level assessment of freshmen reading ability is 
attributable to the 10-item assessment used, which lacked 
medium difficulty items. In the present study, the duration 
of the complete 25 item TABE Survey was beyond what 
could be accommodated logistically; however, our results 
indicate that such logistic considerations must be overcome 
to assess the reading ability of college freshmen adequately. 


Analysis of the 11-13th grade cluster offers suggestive results 
regarding freshmen reading ability, but must be treated with 
caution given that there were only three items in this clus- 
ter. Participants who answered all three items in this cluster 
correctly could reasonably be assumed to be proficient read- 
ers, and the difference between this percentage (32%) and 
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NAEP’s percentage of proficient readers (37%) could be eas- 
ily explained by regional differences. Although demographic 
data was not collected for this study, the freshman demo- 
graphics for the university where the study was conducted 
suggest that approximately half of students are white and 
half are African-American. These two groups have NAEP 
12th grade Reading Proficiency rates of 46% and 17% re- 
spectively, averaging 32% as found in the present study. 


However, as previously noted, only 69% of graduating se- 
niors went straight to college in 2015 [2], suggesting that 
54% of college freshmen should be proficient readers, assum- 
ing that all NAEP Proficient readers attend college. The 
present finding that reading proficiency is closer to the high 
school rate than the projected college rate could reflect a self- 
selection effect whereby the most proficient readers attend 
schools with more stringent admissions criteria on standard- 
ized tests. 


The projection that only 69% of participants who answered 
all three items in the 11-13th grade cluster would pass the 
GED Reading test gives a strikingly different assessment of 
freshman reading proficiency (22% vs. NAEP’s 37%) that 
cannot be easily explained by regional differences and may 
be a useful target for future research. 


Altogether, our findings suggest that two-thirds of college 
freshman assessed have reading ability corresponding with 
below Proficient as described by NAEP. More accurate as- 
sessment and determination of regional differences are im- 
portant areas of future research, as reading proficiency plays 
a large role in college success. 
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ABSTRACT 


Identifying prerequisite relationships among skills is important for 
better student modeling in many educational systems. In this paper, 
we propose a new method to discover prerequisite structure from 
data using nested model comparisons in the context of Bayesian 
estimation. We evaluate our method with simulated data and real 
math test data. 


Keywords 


Prerequisite structure discovery, Bayesian Network, MCMC 
estimation, nested model comparison, pseudo-Bayes factor. 


1. INTRODUCTION 


In many educational systems, the process of learning usually 
proceeds sequentially according to a predetermined order that 
reflects cognitive theories about student learning. In this learning 
sequence some knowledge skills must be acquired prior to learning 
advanced skills. In this study, we refer to prerequisite structure as 
the relationships among skills that put strict constraints on the order 
in which these skills can be mastered. 


Identifying skill prerequisite structure is a crucial step to construct 
a valid and accurate student model in adaptive tutoring system or 
other educational system for estimation of student's skill mastery 
status and provision of appropriate remediation for them. 
Prerequisite structure can be specified by domain experts, but such 
process may be time-consuming and could produce subjective 
models lacking validity. Using large educational data and data 
mining techniques, several previous studies have tried to find 
prerequisite relationships among knowledge skills [1,2,3,7]. To 
derive prerequisite structure from student performance data is 
somewhat challenging in that a student’s mastery status of skills 
cannot be directly observed, but can only be estimated, 1.e, is latent 
in nature. Previous works mostly used Expectation-Maximization 
(EM) estimates for latent skill variables [1,2,3]. 


In this paper, we present a new method for discovering prerequisite 
structure from student performance data using Bayesian Markov 
Chain Monte Carlo (MCMC) estimation and nested model 
comparison. For nested model comparison, we use pseudo-Bayes 
factor (PSBF) [4], one of the Bayesian model selection criteria. 


2. METHOD 


In our method, it is assumed that student performance (item 
response) data at a certain point in time 1s given and skills related 
to items are specified. Skills and items are considered as binary 
random variables and the item-skill relationships are given by Q- 
matrix (a binary matrix that represents the mapping of items to 
skills) [9]. DINA model is used for modeling the probability of 
correct response to an item as a function of whether all the skills 
required are mastered and of slip and guess parameters [5]. To 
represent skill prerequisite structure, (static) Bayesian Network is 
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used as student model. Bayesian network is a_ probabilistic 
graphical model representing the relationship of a set of random 
variables as a directed acyclic graph (DAG) with conditional 
probability tables (CPTs). 


We now focus on the discovery of prerequisite relationship, that is, 
strict hierarchical order between mastery of two skills. To this end, 
we set two types of models: a full model, which parameterizes all 
possible dependencies between skills, and a strict model, which 
assumes prerequisite relationship between a pair of skills. For 
example, Figure | illustrates DAGs and CPTs of a full model 
consisting of three skills (S;, S2, S3) and a strict model assuming 
prerequisite relationship between skill S; and Sz, ( S; is a 
prerequisite for S,). The difference between two models is that, 
while the full model contains the parameter y29 related to the 
probability P(S, = 1 |S, = 0), the strict model put a constraint 
that this probability is zero (that is, the strict model is nested within 
the full model). If skill S; is a true prerequisite for Sz, the parameter 
Y20 in the full model will be estimated to be closed to zero and there 
will be no significant difference in the degree to which the two 
models explain the data. The idea of nested model comparison is to 
statistically test the null hypothesis that the two models present the 
same likelihood on the data. 
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(a) Full model (b) Strict model 


Figure 1. DAGs and CPTs of (a) a full model and (b) a strict 
model of skills S;, S2, S3. The bolded directed edge from S, to 
S, in DAG of the strict model (b) means that S; is a 
prerequisite for mastery of S>. 


(0,1) 
(1,0) 


When two models are fitted to the data using maximum likelihood, 
the likelihood ratio test is used for hypothesis testing. In the context 
of Bayesian estimation, Bayes factor or its variants can be 
considered as the test method. We use pseudo-Bayes factor, which 
can be calculated by the MCMC estimation process, as the test 
statistic to contrast two models. The pseudo-Bayes factor for model 
M, relative to M, is the ratio of approximations of marginal 
likelihood based on predictive distributions and cross-validation 
strategies and defined as 


D(X |M,) = ix D(X; [X_;, M;) 


P(X TM,) = TTje, pC TX, Ma) 

_ TT J p(X; 10, M,)p(O|X_;,M,)do 

7 Ti. J p&% 10, Mz) p(OlX_;,M,)d0 
where X; is the response data of student i, X_; 1s the complement 
of X; in the data X, and © is the set of free parameters. The 
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calculated PsBF value in MCMC estimation is compared to a 
critical value to decide whether to reject the null hypothesis or not. 
If the null hypothesis is not rejected, then the strict model is 
accepted, thus concluding that the prerequisite relationship exists. 


3. EVALUATIONS 


To evaluate the efficiency of our method in discovering prerequisite 
structures, we first conducted a simulation study and then applied 
our method to a real dataset. In this process we faced a problem that 
PsBF values are dispersed from the known distribution of Bayes 
Factor [6]. To address this problem, we derived the critical value 
from the empirical distribution of PsBF values under the null 
hypothesis. 


In our evaluation steps, all MCMC estimation algorithms were 
implemented using R package R2OpenBUGS [8]. For MCMC 
estimations, we set the priors as follows: a uniform prior Unif(0, 1) 
on each structural parameters (y;;) and a beta prior Beta(6, 21) on 
slip and guess parameters for each items. 


3.1 Simulated Data 


In this simulation part, we considered five prerequisite structures of 
latent skills (Figure 2). For each structure, we generated 500 
datasets consisting of 1000 students’ skill mastery status and their 
responses for test items using a balanced Q-matrix (each skills are 
measured with the same number and types of items) under the 
DINA model with low slip and guess probabilities randomly drawn 
from Unif(0,0.05). 


AS (5) 
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Ne 


(b) Structure 2 


(e) Structure 5 


(a) Structure 1 (c) Structure 3 (d) Stricture 4 


Figure 2. Five prerequisite structures of skills used in 
simulation study 


We evaluate our method using two metrics: true positive structure 
rate (TPSR; # of correct structure recoveries in the output / # of true 
structures) and true positive adjacency rate (TPAR; # of correct 
adjacency recoveries in the output / # of adjacencies in true model). 


The results show that our method can efficiently discover 
prerequisite structure (Table 1). In all cases recovery rates of true 
structure are over 80% (the worst rate is 81.6% in structure 4). The 
recovery rates of true prerequisite relationship between two skills 
(edges) are even higher such as over 90%. 


Table 1. TPSR and TPAR results for each structure 


3.2 Real Data Application 


We used mathematics cognitive diagnosis assessment data from 
936 eighth grade students over a set of 16 items measuring four 
skills related to linear equation and linear inequality (Figure 3-a). 
The prerequisite structure of these skills (Figure 3-b) was initially 
set by knowledge experts. 


Figure 3-c shows the prerequisite structure discovered by applying 
our method to the real data. All prerequisite relationships set by 
experts are well discovered, and one additional prerequisite 


Basic arithmetic operations 


Solving linear equation 


Solving linear inequality 


Real world problem solving 
using linear equation and inequality 


(a) 


Figure 3. (a) Four skills in math test; (b) Prerequisite structure 
from knowledge experts; (c) Discovered prerequisite structure 


relationship (S, > S3) is found. A possible explanation for this is 
that while knowledge experts judge that either linear equation or 
linear inequality can be learned first, students usually learn to solve 
linear equation first following the sequence in the curriculum. 


4. CONCLUSION AND FUTURE WORK 


We presented a method to discover skill prerequisite structure from 
data based on nested model comparison and evaluated the method 
using simulated data and real data. The performance of our 
prerequisite structure learning method was good within the settings 
used in our experiments. Since we used only low number of skills 
and certain assumptions for the evaluation, we need to further 
explore our method in various conditions. 


In future work, we will investigate the idea of nested model 
comparison in the context of frequentist estimation (e.g., EM 
estimation) and compare with other previous methods. In this paper 
the focus is only on the prerequisite relationship between skills, but 
there may be other dependence relationships between them along 
with different types of response models. It would be interesting to 
study how to discover skill structures considering various 
dependency relationships in Bayesian Network modeling of skill 
mastery. 
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ABSTRACT 


To date, most MOOCs in major platforms (e.g. Coursera and edX) 
are xMOOCs, which means teacher speech 1s still the major part 
of these MOOCs. Therefore, it is necessary to evaluate the quality 
of lecture and to explore the relationships between lecture quality 
of MOOCs and learning outcomes. The present study attempted to 
explore the lecture styles of instructors in MOOCs by using text 
analysis. One hundred and twenty-nine course transcripts were 
collected from Coursera and edX. We also collected public data of 
course evaluation from the largest MOOC community in China 
(mooc.guokr.com) Linguistic inquiry and word count (LIWC) and 
Coh-Metrix were used to extract text features including self- 
reference, tone, affect, cognitive words, and cohesion. After 
combined students’ comments with clustering analysis, results 
indicated that four different lecture styles emerged from 129 
courses: “mediocre”, “boring”, “perfect” and “enthusiastic”. 
Significant difference was found between four lecture styles for 
the notes taken, but significant differences were not found for the 
course satisfaction and discussion posts. Future studies should 
exam whether different lecture styles have impacts on students’ 
engagement and learning outcomes in MOOCs. 


Keywords 


MOOCs; Lecture styles; Instructors; Text analysis 


ie INTRODUCTION 


Massive open online courses (MOOCs) have attracted much 
attention in the recent years. They provide not only free courses 
from high prestige universities, but also the freedom of learning 
for learners all over the world. Major MOOC platforms, such as 
Coursera, FutureLearn, edX, and Open2Study, are well received 
by most learners. The reason why MOOCs become a popular way 
to learn is that it provides each individual learner with 
opportunities to engage with the materials via formative 
assessments and the ability to personalize her learning 
environment (Evans, Baker & Dee, 2016). 


Researchers from different discipline have conducted many 
studies focused on MOOCs learners, including course completion, 
quality of interaction, student engagement, and collaborative 


learning in MOOCs (Andres et al., in press; Wang & Baker, 2015). 


However, the complexities of teaching have been largely absent 
from emerging MOOC debates (Ross et al., 2014). After all, 
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MOOC is quite different from traditional class in many aspects. 
For example, MOOC instructors were motivated by a sense of 
intrigue, the desire to gain some personal rewards, or a sense of 
altruism; they were challenged by difficulty in evaluating students’ 
work, encountering a lack of student participation in online 
forums, being burdened by the heavy demands of time and money, 
and having a sense of speaking into a “vacuum” due to the 
absence of student immediate feedback (Hew & Cheung, 2014). 
Some instructors found it difficult to teach when not facing a real 
audience of students (Allon, 2012). To date, most MOOCs in 
major platforms (e.g. Coursera and edX) are xMOOCs, which is a 
highly structured, content-driven course and designed for large 
numbers of individuals working mostly alone, teacher speech is 
still the major part of these MOOCs. Therefore, it is necessary to 
evaluate the quality of lecture and to explore the relationships 
between lecture quality of MOOCs and learning outcomes. Some 
researchers have tried to build models to automatically predict if 
certain course content would show up by using natural language 
processing (Araya et al., 2012). Based on the mentioned above, 
the present study attempted to explore the lecture styles of 
instructors in MOOCs by using text analysis. 


2: METHOD 
2.1 Data Collection 


Transcripts from 129 courses (humanities: 24.8%, social science: 
38%, science: 37.2%) were collected from Coursera and edX. We 
also collected public data of course evaluation from the largest 
MOOC community in Mainland China (mooc.guokr.com). This 
community offered online learners a platform on which they could 
voluntarily evaluate MOOCs and share their opinions with fellow 
online learners. The data set we used included course satisfaction, 
the number of asynchronous discussion posts per course, notes 
taken per course, the number of followers per course, to name a 
few. 


2.2 Extracting Text Features 

Two text analysis tools (i.e. LIWC and Coh-Metrix) were used to 
extract text features from 129 course transcripts. According to 
previous studies, self-reference (I, me, my), affect (positive 
emotion and negative emotion), tone, cognitive words, and 
cohesion were extracted. Other features like words per sentence 
and big-words (words are longer than 6 letters) were also viewed 
as complexity measure of teacher speech. 
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2.3 Data Analysis 

Clustering analysis and ANOVA were conducted by using 
RapidMiner and SPSS. We first transformed all the text features 
into Z score, then performed k-means algorithm with euclidean 
distance in RapidMiner. The k value was assigned with a value 
from 2 to 6, because of comprehensibility. 
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Figure 1 The four lecture styles in MOOCs 


os RESULTS AND DISCUSSION 


Four clusters were found, and there were 42, 27, 36, and 24 
courses in each cluster respectively. We then checked the students’ 
comments of these courses in Guoke MOOC community, and 
assigned label to these clusters (Figure 1). 


Concretely, instructors who used the most self-reference words (I, 
me, my), short sentences, and the least big-words were perceived 
as agreeable and enthusiastic by students (Cluster 4: Enthusiastic). 
Instructors who used the least self-reference words, long 
sentences, the most big-words, and showed a low cohesion were 
perceived as boring by students (Cluster 2: Boring). Instructors 
who used the most cognitive words to help students to understand 
and used medium level of self-reference words, big-words and 
showed medium cohesion were labeled as “perfect” (Cluster 3). 
Courses used the most of long sentences and showed average 
level in other dimensions were labeled as “mediocre” (Cluster 1). 
No significant differences were found between four lecture styles 
for the course satisfaction (F = .76, p = .52, y2 = .02) and 
discussion posts (F = 1.39, p = .25, n2 = .03). However, 
significant difference was found for notes taken (F' = 2.80, p = .4, 
n2 = .06). Concretely, the number of notes taken in “perfect” style 
was much more than “mediocre”. Notes taken can stand for the 


cognitive processing of learners to some extents. These results 
suggested that the “perfect” lecture style may be more likely to 
encourage students’ engagement. Since the discussion posts, notes 
taken and course satisfaction data in the present study were 
acquired from a third-party platform, further evidence are needed 
to verify these results. Future studies should examine whether the 
four lecture styles have different impacts on students’ engagement 
and learning outcomes (e.g. academic performance and course 
completion) in MOOCs. 
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ABSTRACT 


We present the process of categorization of students’ questions, 
and through a clustering on students, we show the relevance of 
this classification to identify different profiles of students. It 
opens perspectives in assisting teachers during Q&A sessions. 


Keywords 


Clustering, question taxonomy, students’ behavior. 


1. INTRODUCTION 


Studying learners’ questions while they learn is essential [1], not 
only to understand their level and eventually help them learn 
better [2] but to help teachers in addressing these questions. 
Analyzing students’ questions can help for instance in 
distinguishing deep learning vs. shallow learning [3]. In this 
paper, we are interested in whether the type of questions asked by 
students on an online platform is characteristics of their classroom 
behavior. We investigate this question in the context of an hybrid 
curriculum (like [4]), where students have to ask questions before 
the class to help professors prepare their Q&A session. Our goal 
here is threefold: (RQ1) Can we define a taxonomy of questions 
relevant to analyze students’ questions? (RQ2) Can we automatize 
the identification of these questions? (RQ3) Can annotated 
questions asked by a student inform us about their performance, 
attendance and questioning behavior? 


2. RESEARCH METHODOLOGY 


We addressed these research questions in 3 successive steps: 
(1) we conducted a manual process of categorization of students’ 
questions, which allowed us to propose a taxonomy of questions, 
(2) we used this taxonomy for an automatic annotation of a corpus 
of students’ questions, (3) to identify students’ characteristics 
from the typology of questions they asked, we used clustering 
technique over two courses and then characterized the obtained 
clusters using a different set of features, as in [5]. 


The dataset used for this work is made of questions asked in 2012 
by 1% year medicine/pharmacy students from a major public 
French university (Univ. Joseph Fourier). Each course is made of 
4 to 6 4-week sequences on the PACES! platform. After a 1* 
week dedicated to learning from online material, during week 2 
students must ask questions and vote for questions asked by other 
students on an online forum to help professors prepare their Q&A 
session in week 3. Therefore, for each of the 13 courses, we have 
4 to 6 sets of questions asked by students (6457 questions overall) 
during the 2™ week of each sequence. 


! paces.medatice-grenoble.fr 


francois.bouchet@lip6.fr 


vanda.luengo@lip6.fr 
3. RESULTS 


3.1 Categorization of questions 

To answer to RQI, we took a sample of 600 questions (around 
10% of the corpus size) from two courses (biochemistry [BCH], 
histology & developmental biology [HBDD]), which are 
considered to be among the most difficult courses and had the 
highest number of questions asked. This sample was randomly 
divided in 3 sub-samples of 200 questions to apply 3 different 
categorization steps: a discovery step, a consolidation step and a 
validation step. Step 1 consisted in grouping sentences with 
similarities to extract significant concepts. Then we segmented the 
combined questions to standardize the previous annotation and we 
grouped the extracted categories into independent dimensions, 
where each dimension grouped similar concepts in sub-categories. 
Step 2 consisted in annotating the second sub-sample to validate 
the dimensions previously identified and to make sure they were 
indeed independent from each other. In step 3, we performed a 
double annotation to validate the generality of our categories on 
the remaining sub-sample of 200 sentences. Two human 
annotators used as a unique reference the taxonomy previously 
created. They annotated independently each dimension (average 
kappa = 0.70) — discussions to fix discrepancies led to a final 
refinement of the categories’ description. Finally, a re-annotation 
was performed on the entire sample (600 sentences) to consider 
the changes and to provide a grounded truth for the automatic 
annotation. The final taxonomy is provided in Table 1. 


Table 1. Final question taxonomy from manual annotation 


Type questions 


Ask for an explanation already done in 
the course material. 
2 |Deepen a concept Broaden a_ knowledge, clarify an 
a ambiguity or request for a_ better 
understanding 


Validation / verification | Verify/validate a formulated hypothesis 
Modality explanation 


N/A None — attributed when neither of the 
other values below applies 


_1_fexample Example application (course/exercise) 


Schema Schema application or an explanation 
about it 


Correction Correction exercise 
N/A None — attributed when neither of the 
re 


Roles (utility?) What’s the use / function 
Link between concepts [Verify a link between two concepts 
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pe erie 
a verification 

1 |Mistake/ contradiction {Detect mistake/contradiction in course 
[Biel iasisaa maces or in teacher’s explanation. 


Knowledge in course Verify knowledge 
Check exam-related information 


3.2 Automatic annotation 

To answer to RQ2 and to annotate the whole corpus (and on the 
long term, to use it online to analyze the questions collected), we 
identified keywords representative of each value in each 
dimension (e.g. the word “detail” is representative of a “deepen a 
concept” question). Then we developed an automatic tagger 
which identifies for each question the main value associated to 
each dimension and tags the question as such. We validated the 
automatic annotator by comparing its results on the manually 
annotated subsample of 600 questions and obtained a kappa value 
of 0.74, enough to consider applying it to the full corpus. 


3.3. Links between questions and behavior 

To identify whether the type of questions asked can inform us on 
students’ characteristics, first we performed two clustering 
analyses using K-Means algorithm (with k varying between 2 and 
10) over two datasets: students who asked questions in the BCH 
course (1227 questions asked by N7=244 students) and in the 
HBDD course (979 questions asked by N2=201 students). We 
performed the clustering using as features for each student the 
proportion of each question asked in each dimension (e.g. the 
proportion of questions with value 1 in dimension 1) asked (a) 
overall, (b) during the first half of the course, and (c) during the 
second half of the course (44 features overall). Distinguishing (b) 
and (c) in addition to (a) allowed us to take into account whether 
it was a change in questions asked that could be meaningful, more 
than the overall distribution. We obtained 4 clusters in both cases. 


The second step consisted in characterizing the clusters by 
considering attributes not used for the clustering: students’ grade 
on the final exam on this course (out of 20), attendance ratio 
(from O [never there] to 1 [always there]), the number of questions 
asked in this course, and the number of votes from other students 
on their questions in this course. Students for whom this data was 
not available were excluded from the datasets, leading to two 
smaller sample sizes (N’;=173 and N’2=161). We performed two 
one-way ANOVA for grades on these two clusterings and found 
statistically significant differences (p<0.001 and p<0.001). For the 
other variables, the distribution did not follow a normal law and 
we therefore performed a Kruskal-Wallis H test on ranks 
associated to each variable. The test showed that there was a 
statistically significant difference for attendance (p=0.04 and 
p=0.02), number of questions asked (p<0.001 and p<0.001) and 
number of votes received (p=0.04 and p<0.001) for BCH and 
HBDD respectively. Results are summarized in Table 2. 


Table 2. Differences between the 4 BCH and HBDD clusters 


acy LB 1e] 854 | 0.90 [2.92 | 2.00 
[c_fsof 938 | 093 | 623 | 261, 
p_fat 2 | 093 | ia [122 


appp |-8_4[ 978 [092 [242.47 
[ps6 re 095 7.0071 


4. DISCUSSION AND CONCLUSION 


Overall, when considering the results presented in Table 2, we see 
two similar clusters in both cases: A and D. Cluster A is made of 
around 28-41% of the students with grades lower than average, 
attending less to classes, asking less questions than average but 
which are particularly popular (probably because of votes from 
similar students, but that information was unfortunately not 
available). In terms of questions asked, they had a higher number 
of “how to” questions (cf. dim3-2 in Table 1) than any other 
cluster. On the other end of the spectrum, cluster D is made of 
around 21% of the students with grades above average, high 
attendance, who ask more questions than average that are fairly 
unpopular — we can assume these must be very precise questions 
that already require a good understanding of the content of the 
course, and are thus not deemed as important by other students. 
Interestingly, when comparing the proportion of questions asked 
in the first vs. second half of the class, cluster D students are the 
only ones who asked more questions in the 2™ half of the 4-6 
sequences than in the 1“ half, presumably because the concepts 
presented at the beginning were simpler and easier for them to 
understand. In between, clusters B and C represent more average 
students who differ mostly in terms of number of questions asked. 


Therefore, to answer to RQ3 we have shown that although the 
clustering was performed exclusively on semantic features (cf 
taxonomy in Table 1), it correlates with information relative to 
students’ performance, attendance and questioning/voting 
behavior. Our work has some limits: we have applied it only to 2 
courses (because a minimum number of questions is required) and 
we have not considered if it would be possible to classify students 
in clusters online or even if the same clusters could be found in 
the same courses on different years. Furthermore, not all questions 
could be automatically annotated, which reduced the dataset size 
and is particularly problematic for students who asked few 
questions. However, this work demonstrates the validity and the 
usefulness of our taxonomy, and shows the relevance of this 
classification to identify different students’ profiles. It also 
suggests the taxonomy could be useful for our long-term goal 
which is to assist teachers in choosing questions to be explained 
in Q&A sessions. We also intend to apply this taxonomy to 
different datasets (e.g. questions asked in a MOOC) to see if it can 
also be useful in these contexts and if similar patterns appear. 
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ABSTRACT 


Interactive Strategy Training for Active Reading and Thinking 
(GSTART) is an intelligent tutoring system that supports reading 
comprehension through self-explanation (SE) training. This 
study tested how two metacognitive features, presented in a 2 x 
2 design, affected students’ SE scores during training. The 
performance notification feature notified students when their 
average SE score dropped below an experimenter-set threshold. 
The self-rating feature asked participants to rate their own SE 
scores. Analyses of SE scores during training indicated that 
neither feature increased SE scores and, on the contrary, seemed 
to decrease SE performance after the first instance. These 
findings suggest that too many metacognitive prompts can be 
detrimental, particularly in a system that provides metacognitive 
strategy training. 


Keywords 


intelligent tutoring systems; metacognition; educational games; 
system interaction logs 


1. INTRODUCTION 


Intelligent tutoring systems (ITSs) provide an opportunity for 
extended training and individualized feedback to support the 
development of skills and strategies. One such ITS, Interactive 
Strategy Training for Active Reading and Thinking (START) 
uses self-explanation (SE) training as a means of increasing 
students’ comprehension of complex texts [4]. iSTART provides 
instruction on SE strategies through lesson videos, guided 
demonstration, and practice. Research indicates that prompting 
metacognition, or reflection on one’s own knowledge, can 
enhance the benefits of training within computer-based learning 
[1]. In this study, we expand upon previous research to 
investigate how two metacognitive features affect the SE scores 
during iSTART practice. 


In iSTART’s generative practice, students write their own SEs 
and a natural language processing (NLP) algorithm immediately 
provides a score of poor (0), fair (1), good (2), or great (3). The 
two metacognitive features were implemented within this 
generative practice. The first feature is a performance 
notification that alerts students that their SE score is below 2.0 
and sends them to Coached Practice for remediation. The second 
feature is a self-rating that prompts students to rate the quality of 


their SE before receiving the computer-generated score. The 
performance notification encourages metacognition indirectly, 
whereas the self-rating is a direct metacognitive prompt [6]. The 
current study expands on data reported in [3], which further 
demonstrated the positive effects of i1START on deep 
comprehension, but also indicated that neither metacognitive 
feature affected post-training learning outcomes. In this study, 
we explore the log-data to investigate how these two 
metacognitive features, both individually and in combination, 
affect SE scores during iSTART generative practice. 


Based on previous work [6], we predicted that the performance 
notification would increase SE scores immediately after the first 
instance of the notification. In [6], however, the instruction was 
brief, and did not allow examining further instances of the 
notification. In this study, we examine the effects of the 
notification after the initial instance during a longer duration 
study. Consistent with previous research [5], we had predicted 
that self-ratings would improve performance. Of particular 
interest was the interaction of the two features. One hypothesis 
is that there would be an additive effect such that having both 
features would yield the greatest SE score improvement [2]. An 
alternative hypothesis is that the redundancy of the two features 
would result in an interactive, and possibly negative effect [4]. 


pa METHODS 


2.1 Participants 

As part of the larger study reported in [3], 116 high school 
students (Mage=17.67, SD=1.30) received monetary 
compensation for their participation. 


2.2 Design and procedure 

The study employed a 2(performance notification: off, on) x 
2(self-rating: off, on) between-subjects design. Participants 
completed iSTART training in three 2-hour sessions. 
Participants first watched 1iSTART video lessons that provide 
instruction on the purpose of SE training and _ five 
comprehension _ strategies (comprehension monitoring, 
paraphrasing, prediction, elaboration, and bridging). Next, 
participants completed one round of Coached Practice, in which 
a pedagogical agent provides individualized feedback on 
students’ self-explanations. Participants were then allowed to 
move freely throughout the system to interact with videos, 
Coached Practice, identification games, and generative games 
for the remainder of the training sessions. The metacognitive 
features were implemented only during generative games. 
Performance notifications were triggered each time the average 
SE score was less than 2.0 and self-rating prompts were 
triggered on randomly-determined self-explanations 
approximately 1/3 of the time. 
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3. RESULTS 


We calculated a gain score to compare the average SE score in 
the game before and immediately following an average 
generative game score of 2.0 indicative of when the performance 
notification was triggered (or would have triggered in the 
notification off conditions). We used log-data to identify 
participants who completed at least one game in which their 
average SE score was less than 2.0 (n=78). Though the 
performance notification could be triggered as many times as 
necessary, most participants had no more than two instances of 
less than 2.0 average SE scores (Fig. 1). As participants were 
able to move freely through the system, only 48 participants 
(across all conditions) followed the generative game, 
notification, generative game sequence needed to calculate a 
gain score. These participants were relatively evenly distributed 
across the conditions. We analyzed the first two instances of 
average SE scores less than 2.0 for these 48 participants. 
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Figure. 1 Frequency of Games with Average SE Scores < 2.0 


For the first instance of notification, the average gain scores in 
all conditions were positive. Though the pattern of gain scores 
for the performance notification 1s consistent with previous 
findings [3], an ANOVA indicated no effect of notification, of 
self-rating, and no interaction, all F(1, 47) < 2.00 (Fig. 2, left). 
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Figure 2. Gain score in Ist and 2nd instance of avg. SE score < 
2.0 as a function of performance notification and self-rating 


Fewer participants (n=27) had a second instance of notification. 
Contrary to the scores following the first instance, in this second 
instance, average gain scores were either near zero or negative, 
indicating that the scores after notification were the same or 
lower than before the notification. An ANOVA revealed no 
main effect of performance notification or self-rating, F's < 1.00, 
ns. There was a significant notification by self-rating interaction 
indicating that having neither feature or both features did not 
affect SE score, but that the presence of only one metacognitive 


feature was detrimental to SE score, F(1, 26)=5.46, p < .05, 
W p= 17 (Fig. 2, right). 


4. CONCLUSIONS 


These findings indicate that neither metacognitive feature had a 
consistent effect on SE quality during iSTART training. Though 
there was an overall increase in SE score in the first instance (as 
indicated by positive gain scores), there was no significant effect 
of either performance notification or self-rating compared to 
control. In the second instance, the interaction should be 
interpreted with caution given the small sample size. 
Nonetheless, the features did not improve SE score, and were 
potentially detrimental to performance. One explanation for 
these findings is that iSTART intrinsically instructs on 
metacognitive strategies. Hence, the inclusion of additional 
metacognitive prompts may be redundant, if not overwhelming, 
at least after the first instance. 


These results were not consistent with extant research, and may 
be particular to iSTART. Certainly further analyses and studies 
are merited and will be explored. Nonetheless, given that neither 
prompt showed post-training learning outcomes [3] or sustained 
training benefits, we do not intend to include these features in 
future implementations of START, and we would caution other 
researchers to consider the possibility of potential metacognitive 
prompt over-dosages. 
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ABSTRACT 


We conducted a pilot study that used kernel-level packet 
capture to record the web pages visited by college students 
and the reading difficulty of those pages. Our results indi- 
cate that i) no students were fully compliant in their partic- 
ipation, ii) the number of texts encountered by participants 
was highly skewed, iii) the reading difficulty of texts was 
about 7th grade, M = 7.24, CI95[7.04, 7.43], though diff- 
culty varied by participant, and iv) the increasing use of 
encryption is likely a limiting factor for using kernel-level 
packet capture to measure online reading in the future. 


Keywords 


reading, Internet, measurement, text difficulty 


1. INTRODUCTION 


A recent survey revealed that approximately 90% of under- 
graduate respondents used laptops for their electronic course 
readings even though 68% did not prefer electronic text- 
books to print [3]. The increase in online reading behavior 
has created new opportunities for researchers to track eco- 
logically valid reading behavior. Online reading reflects true 
interests and goals (unlike artificial experimental paradigms) 
and further allows measures of the time spent reading and 
of the text itself over extended periods of time. 


To better understand the online reading behavior of college 
freshmen, we conducted a pilot study using custom-designed 
online reading tracking software based on kernel-level packet 
capture. Tracking naturalistic online reading behavior ap- 
pears to be novel to the literature, as most studies of on- 
line reading behavior either use lab-based methods like eye- 
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tracking or self-report methods like surveys. Our main re- 
search objectives were to determine whether i) participants 
would comply with the tracking, ii) the reading behavior 
of participants was measured consistently, and iii) the text 
difficulty of measured texts was in a reasonable range. 


2. METHOD 


2.1 Participants 

Participants (N = 7) were recruited through the psychology 
subject pool at an urban university in the southern United 
States. Self-reported ACT scores (M = 21.29,SD = 3.64) 
ranged from 18 to 29. Participants were required to own 
and bring a laptop to the study when they enrolled. 


2.2 Materials 


Kernel-level packet capture software for tracking online read- 
ing behavior was developed in C* using the WinPcap and 
PcapDotNet packet capture libraries. The resulting soft- 
ware, called SNARF, runs as a Microsoft Windows service in 
the background whenever the computer is turned on. SNARF 
monitored all http packet traffic on all network devices and 
sent anonymized timestamped records of web page URLs 
to an online Google Fusion Tables service for collection. 
Records were anonymized by using the media access con- 
trol (MAC) address of the participant’s network card as an 
identifier. To minimize data traffic, SNARF sent only URLs 
that did not match a blacklist of known non-reading-related 
URLs, such as Windows Update and image/audio/video file- 
types. Also excluded from collection was any service using 
the encrypted https protocol. Encrypted traffic was ex- 
cluded for two reasons. First, it is highly likely that en- 
crypted traffic is of a personal nature that the participants 
would prefer not to share, e.g. email, banking, or health in- 
formation. Secondly, breaking encryption could potentially 
introduce security vulnerabilities and put participants at sig- 
nificant risk. 


2.3 Procedure 

Approval for the research protocol was obtained from our 
institutional review board. Participants were enrolled in 
the study in the fall of 2015. After consent was obtained, 
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Table 1: Participant reading behavior 


Flesch-Kincaid Grade Level Word Count 
95% CI 95% CI 

Id Texts Days M (SD) LL UL M (S'D) LL UL 

1 1 O07 - 

2 23 47 9.30 (8.05) 6.01 12.59 1137.10 (1985.10) 325.83 1948.30 
3 170 = =100° 6.98 (5.74) 6.12 7.85 509.72 (1578.30) 272.46 746.97 
4 210 1017 9.20 (6.67) 8.30 10.11 1152.50 (2086.00) 870.37 1434.60 
5 829 947 7.15 (5.57) 6.77 £58 963.39 (1778.20) 842.34 1084.40 
6 4 50° 7.28 (7.13) 0.29 14.26 14.00 (8.98) 5.20 22.80 
if 3116 = 1197 7.10 (6.76) 6.86 7.34 ANTE (1236.40) 374.36 461.18 


Note: CI = confidence interval; LL = lower limit; UL = upper limit; -/+ indicates under/over study length. 


an experimenter installed the SNARF online reading behav- 
ior tracker onto the participant’s laptop and confirmed that 
SNARF was logging data to the Google Fusion Table service. 
At the end of the study, each recorded URL was queried 
and, if it was accessible, downloaded. Text from downloaded 
files was extracted using the Apache Tika library, tokenized 
into sentences using the Stanford CoreNLP tools [2], and 
then measured for word count and text difficulty using the 
Flesch-Kincaid Grade Level metric [1]. 


3. RESULTS & DISCUSSION 

Of the 327,179 timestamped URLs collected, only 87,029 
were unique, and of those unique URLs, only 26,762 (31%) 
were downloadable at the end of the study. Inspection of the 
timestamped URLs revealed that, despite efforts to black- 
list non-reading-related web traffic, many URLs were not 
reading-related, e.g. antivirus updates, ads, and video web- 
sites. 


Texts from downloadable URLs had extreme Flesch-Kincaid 
Grade Level (FKGL) values ranging from -3.40 to 7431, and 
extreme word count values ranging from 0 to approximately 
10 million. Inspection of the data revealed that the FKGL 
frequency distribution dropped precipitously at grade level 
20 and that the word count frequency distribution likewise 
dropped at 10,000 words. These values would be possible if a 
participant read a document with an average sentence length 
of 22 and average syllables per word of 2.3 (FKGL) or a 20- 
page single spaced paper (word count); thus these values are 
plausible but may be overly generous. Descriptive statistics 
for the texts and downloadable URLs after applying these 
filtering criteria are shown in Table 1. 


Table 1 presents evidence addressing our research objec- 
tives. First, participants did not comply with tracking: two 
participants uninstalled the software within a week (one 
within the same day) and the remaining five participants 
failed to uninstall the software or meet the experimenter to 
uninstall the software after being reminded by email. Sec- 
ondly, participant’s online reading behavior was not mea- 
sured evenly: the number of texts (as measured by down- 
loadable URLs) read by participants was highly skewed, 
ranging from 1 to over 3,000. This skewed distribution could 
be caused by some participants mostly using encrypted sites 
like Wikipedia or the New York Times which, by virtue 
of being encrypted, SNARF would not record. Finally, the 
reading difficulty of texts was in a reasonable range, gener- 
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ally 7th grade, M = 7.24, Cl95[7.04, 7.43], and word count 
on average was comparable to a page of single spaced text, 
M = 564, Cl195[521, 507], though both varied somewhat by 
participant as shown in ‘Table 1. ‘These results are slightly 
lower than might be expected when reading for academic 
purposes, but for general reading seem reasonable. 


4. CONCLUSIONS 


Our results indicate that kernel-level packet capture is a vi- 
able means for measuring online reading behavior save for 
the increasingly prevalent use of encryption on all web sites. 
While it would be possible to modify a browser to record 
the text displayed to the user, this alternative could inad- 
vertently collect email, banking, or health information that 
should remain private. Thus it may be that the balance be- 
tween privacy concerns and reading research is best struck 
by avoiding general purpose reading applications like web 
browsers and instead focusing on reading-specific applica- 
tions that are not otherwise used to access personal infor- 
mation. 
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ABSTRACT 


Many educators have been alarmed by the high dropout rates in 
MOOC. There are various factors, such as lack of satisfaction or 
attribution, may lead learners to drop out. Educational 
interventions targeting such risk may help reduce dropout rates. 
The primary task of intervention design requires the ability to 
predict dropouts accurately and early enough to deliver timely 
intervention. In this paper, we present a dropout predictor that 
uses student activity features and then we add learners’ study 
habits features to improve the accuracy. Our models achieved an 
average AUC (receiver operating characteristic area-under-the- 
curve) as high as 0.838 (if lacking study habits is 0.795) when 
predicting one week in advance. The model with learners’ study 
habits features attained average increase in AUC of 0.03, 0.06, 
0.08 and 0.05 in different cohorts (passive collaborator, wiki 
contributor, forum contributor, and fully collaborative). 


Keywords 
MOOC, dropout prediction, study habits 


1. INTRODUCTION 


One way to solve the high dropout rates in MOOC is to deliver 
timely intervention by predicting the dropout probability. Some 
researchers focused on extracting features of learners’ study 
activities (such as resource accessing) from MOOCs’ log, and 
then building machine learning models. Balakrishnan [1] used the 
discrete single stream HMMs model to predict whether a student 
would dropout or not. [2] tried to establish an extensible real-time 
predicting model, which is fit for any different courses. Loya [3] 
demonstrated that who executed their learning process on 
schedule has greater probability to finish the course in MOOCSs. 
Liang J [4] predicted a student’s dropout state 10 days later with 3 
months’ data into four typical machine learning models 
(LR/SVM/GBDT/RF). 


Taylor C. [5] used the dataset of 6.002x: Circuits and Electronics 
taught in Fall of 2012 on edX, includes course information and 
students’ activity data. In addition to the common simple features, 
they produced some complex, multi-layered interpretive features, 
and then used them as the input of predicting models. They 
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divided the students into four groups according to their 
participation: passive collaborator are those learners never 
actively participated in either the forum or the Wiki, they just 
view the resources, but did not have contributions; wiki 
contributor are those learners generated Wiki content, but never 
posted in the forum; forum contributor are those learners posted in 
the forum, but never actively participated in the Wiki; fully 
collaborative are those learners actively participated by 
generating Wiki content and posting in the forum. Their results 
shown that if the sample size of the students group is small 
(especial for wiki contributor, forum contributor and _ fully 
collaborative), the predicting accuracy is relative low. 


In our work, we focus on extracting more important features of 
learners’ study habits features to improve the accuracy of 
predicting models, particularly for the small sample size group. 


2. PREDICTION PROBLEM DEFINITION 

Our data obtained from the 2014 instance of the introductory 
physics MOOC 8.MReV through the edX platform. We 
considered defining the dropout point as the time slice (week) a 
learner fails to submit any further assignments or problems / exam. 


The instructor could use the data from week 1 to the current week 
i to make predictions. The model will predict existing learner 
dropout during week (1 + 1) to week 16. For example, current 
week is week 7, and we use the logging data from week | to week 
7 to predict the learners’ performance at week 12 with lead equals 
to 4 and /ag equals to 7. 


3. FEATURES ENGINEERIN 


Table 1. Self-proposed covariates 


NAME Definition 


xT | stepant Wherher the student cantinne anbmit prehlem 
x2 |total_duretion 


= 
et fonmker wid geass ——————S—=(Nmrfwikdpews SSCS 
5 
6 loons cinneepobiann stbmineé __[Nombor of inet pobienwenangiod SSCS 
= 
[omer cine potion stbmined owen’ [Nombor of inet conssrpcblons® ———SSSSCSCSCSCSCSCSC~*Y 
ea 


average rember submissions Average number of submissions per problem (x7 / x6) 


Total time spent on all resources 


Total Ginn: spent Aguzober of distinet comeci problems (a2 / x 


Nuwibst of problems atieropted / number of correct probluis (x6 / x8} 


Total number of ecllaborations (x3 + 4 ) 


DBuratioa of lounges: observed oven 


Total time spent on lecture resources 


Total time spent on book resources 


total_wiki_ duration Total time spent on wiki rcacurees 


We extracted 18 self-proposed features, 7 crowd-proposed 
features (according to Taylor’s work [5]) and 6 study habits 
related behavioral features on a per-learner basis, these features 
are list in table 1, table 2 and table 3. And then these features are 
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assembled from different weeks as separate variables to build 
predictive models. 


Table 2. Crow-proposed covariates 


}x301 [problem finish percent _pre_start24h | The number of problem leamer finished correctly in the first 24b after the problem issued 

[x302 | problem_finish_percent_pre_deadline24b | The number of problem leamer finished correctly in the last 24b before the problem due 

}x303 | time_first_visit Min(time_first_problem_get, time_first_html etext_access) - project_issue_time 

aot [eines is. ciesk _____Aveupe ofl prblen tine betwen problem fei chk and pote fot | 
305 [wad tie pivalt [Tal boo daon bir postion esl ial vido dersoatefre tion it | 
2 A TINeieeraies 


4. RESULTS 


As shown in figure 1, for all learners, our models achieved an 
average AUC as high as 0.838 (and lacking study habits features 
is 0.795) when predicting one week in advance. 


Logistic regression results for all learners Logistic regression results for all learners 


1.0 with study behavior feature 
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F 0.4 6 
5 } 0.4 
i 03 5 10.3 
3 0.2 q . 0.2 
: 0.1 2 0.1 
> is A sO 0% a % . ao 0.0 
PPP SPSEIO SE EP POPPIES I IOP I ES 


The predicted week number The predicted week number 


Figure 1. Heatmap for the logistic regression dropout 
prediction problem 


From feature importance analysis as shown in figure 2, the study 
habits related behavioral features (x301-306) had played more 
important roles in the dropout prediction. Top features that had 
the most predictive power including 
problem_finish_percent pre deadline24h,  study_before_submit, 
and time_first_visit. 


Feature im 


ortance of all lag lead 


Feature 


0 o1 02 03 04 05 06 o7 58 os 
Feature importance 


Figure 2. Feature importance 


With new features related to study habits, the AUC of our 
predicting improved (figure 3), especially for the small sample 
size group (wiki / forum contributor and fully collaborative). 


Logistic regression results for the 
forum contributor cohort with study behavior ra 0 


Logistic regression results for the 
forum contributor cohort 
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Figure 2. Heatmap for the logistic regression dropout 
prediction problem for three groups 


In the future, we will try to using improved predictor each week 
within the course progress to deliver the intervention into small 
private online course. 
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ABSTRACT 


One of the issues that MOOCs face since its emergence is 
the low engagement rate and accomplish rate. As an open 
and free education source, MOOCs are available for people 
around the world with different motivations and previous 
knowledge to join. It is a challenge to keep students en- 
gaged in a MOOC environment. In the present study, we 
implement a polytomous item response model (IRT) to ex- 
plore the relationship between students’ self-evaluation of 
their previous knowledge and students’ engagement behav- 
iors in a Geography MOOC. Specifically, we estimate stu- 
dents’ latent trait, pre-knowledge, through 15 likert-scale 
items. Engagement behaviors include assignment, peer re- 
view, forum, comment, quiz, and lecture. Each of them 
is quantified by the aggregated frequency. Then we exam- 
ine the correlation between pre-knowledge and each type of 
engagement behavior. We find self-evaluation on previous 
knowledge cannot predict students’ engagement behaviors 
for any type of engagement. This application indicates that 
the self-evaluation of pre-knowledge does not predict student 
engagement in MOOC environment. However, it shows that 
traditional psychometric models used for standardized tests 
may be useful and promising in the MOOC context. 


Keywords 
MOOC, engagement, pre-knowledge, Polytomous IRT 


1. INTRODUCTION 


A massive open online course (MOOC) is a model for deliv- 
ering learning content online to anyone who wants to take 
a course, with no limit on attendance. MOOC engagement 
is a concept to describe students’ involvement of a MOOC. 
Usually it includes behaviors like posting questions and com- 
ments in the MOOC system, submitting assignment and 
quiz, and other behaviors, which can directly predict stu- 
dents’ achievement. Although during the past decade, the 
number of MOOC students increased tremendously across 
the world, the low accomplishment and low level of active 
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engagement is always a problem for MOOC development 
[1]. MOOC engagement is important to predict students’ 
achievement and to show whether students really learned 
something from the course or not. Students’ prior knowl- 
edge, which was defined by first two assignments’ perfor- 
mance, in computer science and problem solving had impact 
on their MOOC performance [3]. In the current research, we 
used pre-course survey data to define pre-knowledge of Ge- 
ography and to explore if it can predict students’ MOOC 
engagement. Also we use a polytomous IRT model to exam 
each item and their performance. 


2. POLYTOMOUS IRT 


Polytomous IRT model is an important model in the IRT 
family, which is designed for items with more than 2 pos- 
sible options. Within polytomous IRT models, there are 
mainly four types: the partial credit model, the rating scale 
model, the generalized partial credit model, and the graded 
response model. One example of the application of the 
graded response model is attitude survey data. Usually the 
format of item in an attitude survey is likert-scale. For ex- 
ample, for question, "how much do you think you like this 
opera?”, the options can be 5 likert scale from ”I like it very 
much” to "I don’t like it at all”. The mathematic equation 
for polytomous IRT model is the following: 


eP4; (0;—b;;) 


PAG) = POG? 40) = 


In the above equation, D equals to 1.7. For each item j, a; 
is a discrimination parameter, and 6;; is the difficulty pa- 
rameter for each option i in each item j (b1<be<...<bn) [2]. 
Figure 1 indicates a graded response function of a polyto- 
mous item. Take the blue line as an example, people with 
higher theta level seldom choose this option, since the slope 
is roughly negative. 


3. METHOD 


3.1 Data 


Data comes from a MOOC in Geography. It has enrolled 
over 100,000 students from 200 countries to date. Data from 
its 2014 class was used in the present study. In total, after 
excluding students with little data, there were 3058 students 
in the current analysis. 


3.2 Measure 


A10 


—— P(X=0[Theta) 
m= P(X=1[Theta) 
~~ P(X=2[Theta) 
~~~ P(X=3[Theta) 
—— P(X=4|Theta) 


Theta 


Figure 1: Graded response function 


Table 1: Factor loading for each item 
Item 1 


a ere a Se 
Factor Loading 0.630 0.427 0.608 0.782 0.522 
7 10 


Item 


a Se a GS a ee 


Teem 15 


There are 15 seven-point likert-scale items, from "strongly 
agree” to "strongly disagree” designed for students to eval- 
uate their pre-knowledge of Geography. One example is I 
enjoy reading maps.” In terms of the students’ engagement 
behavior, there are six criteria including assignment, peer 
review, forum, comment, quiz, and lecture. The method 
for quantify them is to aggregate the number of times they 
participate in each type of behavior. 


3.3. Procedure 

The graded response model was applied using package mirt 
in R to estimate students’ pre-knowledge of Geography. Then 
the Pearson correlation coefficients between pre-knowledge 
and each type of engagement behaviors were calculated re- 
spectively to examine if students’ pre-knowledge influence 
their engagement behaviors in the MOOC environment. 


4. RESULTS 

The model fit indices verify a good model fit (RMSEA=0.047, 
RMSEA_5=0.041, RMSEA_95=0.053, CFI=0.959). The fac- 
tor loading estimation shows that these 15 items can be used 
to measure the latent trait, pre-knowledge of Geography (ta- 
ble 1). The parameter estimates are presented in table 2, 
and the graded response function for each items is shown 
in the following figure 2. Additionally, table 3 presents the 
correlation coefficients between pre-knowledge of Geography 
and each type of engagement behavior. 


5. CONCLUSIONS 


Table 2: Parameter estimation for each item. 
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Figure 2: Graded response function for each item. 


Table 3: The Pearson correlation coefficient be- 
tween pre-knowledge and Engagement Behavior 
Type (EBT) 

Pre-knowledge o 


“0.018 
“0.022 


xseography 


All of the 15 items have relatively good loading on one fac- 
tor, so it is reasonable to use one-dimensional IRI model. 
Also, the fit indices show that this graded response model 
fit well with the data. In terms of the discrimination index, 
item 8, item 9, item 15 have very good discrimination level. 
It indicates that these three items can provide more infor- 
mation in terms of students’ pre-knowledge of Geography 
than other items. In terms of the difficulty parameter, b4 
cannot be estimated for item 1, item 2, and item 14. This 
indicates that these items might be problematic. 


All of the correlation coefficients are negative and nonsignifi- 
cant (p-value>.05). This results indicates that although the 
general trend is students with less pre-knowledge of Geogra- 
phy will have less frequency of engagement behavior, none 
of them are statistically significant. In other words, whether 
students report a relative rich or poor pre-knowledge of Ge- 
ography cannot predict their engagement behaviors. One of 
the explanation may be the pre-knowledge here is measured 
by self-evaluation, which relates to the meta-cognitional abil- 
ity of students. This subjective report is different from ob- 
jective questions, such as "have you taken any university 
level courses related to this MOOC course?” In further re- 
search, more direct measure of pre-knowledge is needed. 
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ABSTRACT 


When learners become frustrated or confused, they can ask for 
help by posing questions in MOOCs forums. Students’ questions 
reveal their needs and learning problems. If not answered timely 
and effectively, they may drop out. In the present study, students’ 
questions from one Chinese MOOCs forum were collected and 
classified. Results showed that most of the posts in the forum 
were questions and the quantity of questions decreased over time 
although in some weeks the number of questions increased. 
Different types of questions have their own _ variation 
characteristics which means that the instructors need to focus on 
certain types of questions in the corresponding period. 


Keywords 


Student questions, MOOCs forum, classification, time-variation. 


1. INTRODUCTION 


Educators think highly of students’ question asking. Questions 
posed by students can reflect active learning, knowledge 
construction, curiosity and the depth of the learning process [1]. 
Through analysis of these questions, instructors can better 
understand a student's thinking, so as to make more targeted 
teaching decisions [2]. Besides, students’ questioning asking has 
association with their achievement. Learners with good 
performance behave better in the frequency or quality of 
questioning [3][4]. Thus, Teachers can also assess students 
learning based on their questions. 


Researchers have investigated students’ questioning behavior in a 
variety of educational settings, such as classroom, tutoring, online 
learning environments[1]. MOOCs allow students to pose their 
questions in a forum format and then wait for their questions to be 
answered by instructors and peer students. This online learning 
mode and asynchronous discussion pattern influences students’ 
questioning behavior. Students may pose different kinds of 
questions at any time and at any place anonymously. The present 
study investigated students’ questioning behaviors in the MOOCs 
forums including the quantity, classification and variations over 
time. According to previous research and forum data, we first 
establish standards to screen question posts, then classify and 
count the quantity of them, and finally observe the variation in the 
entire course. 
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2. DATA AND ANALYSIS 
2.1 Platform and Data 


We analyzed a forum of the course The Introduction to 
Psychology on the Chinese MOOCs platform XuetangX, which 
was launched in October 2013. This course has been opened for 
several sessions and has a large enrollment with tens of thousands 
learners. We chose the data for the 2015 Spring Session as it had 
the largest number of posts in the forum, starting from March 4th 
to September 15th. The whole course had 12-week lectures and 
two exams. The mid-term test took place between the 10th week 
and the 12th week. The final exam period ran from the 15th to 
16th week. All the data came from www.kddcup2015.com and 
www.xuetangx.com. 


2.2 Question Selection and Classification 

First, we selected question posts from all the data. We regarded 
the question mark in the sentence as a marker feature. Some 
modal words and question words were also taken into 
consideration, such as“ 7 4\ 7 (whether or not)”, “ ff A 
(what)”,“ ‘EZ (how)”,“ A ft @ (why)”. And there are some fixed 
expression of questions, such as “3% 4 fi (I do not know)",“4k 48 
®X/E BX (I am confused)’[4]. Two researchers labeled the posts 
separately, then compared and made an agreement on the 
differences. The inter-rater agreement was 86% (representing 


agreement on 880 items out of 1029 opportunities for agreement 
multiplied by 100). 


After filtering posts, a taxonomy of the questions was created 
based on Brinton’s[5] classification on MOOCs discussion 
threads and question posts in the forum, including five categories: 
(1) Course management questions, relating to course design, time 
arrangement, learning resources, etc.; (2) Course content 
questions, involving learner's understanding of the learning 
materials or exercises; (3) Interaction questions, where learners 
ask and exchange experiences, learning methods and emotions; (4) 
Platform operation questions, students encounter when operating 
the platform; (5) Other, including vague expression and irrelevant 
questions. Two researchers classified the question posts separately 
and then reached an agreement. The inter-rater agreement was 
82% (representing agreement on 613 items out of 751 
opportunities for agreement multiplied by 100). 


We calculated the total amount of students’ question posts, the 
distribution of different classifications and different types of 
question variation over the weeks of the course. 


3. RESULTS 

3.1 The Quantity of Students’ Question 
Posing 

In the forum, 1002 people participated in the discussion, 
accounting for only 3 per cent of the total registers. Among them, 
569 students posed 1029 posts, getting 3165 replies, which means 


that the average reply per post is 3.1. Two researchers screened 
751 question posts, accounted for about 73% of the total posts, 


Proceedings of the 10th International Conference on Educational Data Mining 412 


indicating that learners’ main activity in the MOOCs forum was 
question asking and answering. Figure 1 shows the quantity of 
students’ questions over the course weeks. The number of posts 
decreased in general with a few fluctuations. 
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Figure 1: The quantity of students’ questions over 
course weeks 


3.2 The Distribution of Five Categories 

Table 1 shows the amounts and proportions of five categories, as 
well as number of replies and average reply per question on each 
category. The quantity of course management questions are the 
most while course content questions are only the second. This 
may due to instructors’ low participation in the forum. In the 
whole course, only some community assistants and administrators 
posed a limited number of posts and answers. However, course 
management questions and platform operation questions mainly 
rely on instructors’ answers. As for the course content questions 
and interaction questions, they can be answered by both 
instructors and peer learners. Without prompt and proper replies, 
the first and forth kinds of questions will be repeatedly asked. So 
the average reply of them are lower than course content questions 
and interaction questions. 


Table 1. The quantity of questions and their replies 


Question Proportion Average 
fons Quantity | of the total | Replies |reply per 
YP questions question 


Course 


management 
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3.3. The Time-variation of Three Categories 

As only a very small number of questions belong to the third and 
fifth category, we removed them from further analysis and 
calculated the quantity of the other three categories by course 
week. Figure 2 shows the relationship between course weeks and 
question quantity, suggesting a decreasing trend for all the types 
of questions. However, each type also has its specific 
characteristics. Course management questions existed throughout 
the course, because learners will generate a series of questions on 
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textbook, exam, and certificate from start to end. At some time, 
these questions increased significantly. In contrast, course content 
questions disappear after the lectures are over. Questions mainly 
emerge in certain chapters. As to the platform operation questions, 
the proportion is lower while students may encounter more 
problems in some weeks on the practice submission. 
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Figure 2: Question quantity of three categories in every 
course week 

To summarize, through the analysis of students’ questions in the 
forum, we can learn the patterns of their questioning behavior and 
in turn improve instructions in MOOCs. Instructors need to focus 
on certain kind of questions during different periods and provide 
appropriate guidance and answers. Course management questions 
and platform operation questions will influence learners’ learning 
progress, so instructors should clearly describe details of course 
arrangement to avoid misunderstanding and confusion. When 
platform errors occur, they need to solve the problem as quickly 
as possible or give suggestions to learners. As to the course 
content questions, even without instructors’ replies, learners and 
peers will try to discuss and find answers by themselves. So the 
main task of instructors are guiding their discussion and giving 
answers at the proper time. 


The current study is part of a larger project studying the long-term 
impact of question asking/answering in MOOCs. We expect a 
significant relation between student’s completion rate and the way 
students questioning/answering behaviors. Further study will be 
reported in the future. 
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ABSTRACT 


We present an active learning system for coding exercises 
in Massively Open Online Courses (MOOCs) based on real- 
time feedback. Our system enables efficient collection of 
personalized feedback via an instructor tool for automated 
discovery and classification of bugs. 


1, INTRODUCTION 


Active learning is a learning approach that “requires stu- 
dents to do meaningful learning activities” in contrast to tra- 
ditional lecture-based approaches where “students passively 
receive information from the instructor” [2]. In active learn- 
ing, timely feedback is important as it helps learning and 
reduces the risk of learner disengagement due to repeated 
failure to complete learning activities. 


MOOCs have leveraged in-videos quizzes as an active learn- 
ing strategy, but these quizzes have traditionally been lim- 
ited to multiple choice questions. One reason that introduc- 
ing higher order tasks, such as coding exercises, has been 
challenging is that it is difficult to provide good feedback. 
Most automated code grading systems allow for efficient 
grading through unit testing, but these methods are often 
limited in the forms of feedback they can provide. 


Feedback that helps learners understand their errors can im- 
prove learning outcomes. Stamper et al. [5] demonstrated 
significant problem completion rate improvements in a logic 
course when feedback was available to learners. ‘This has 
motivated related developments in data-driven methods to 
generate such feedback [8, 4, 1]. 


In this demo, we will show a system that enables instruc- 
tors to efficiently generate and provide real-time feedback 
for programming exercises in MOOCs through extensions 
to Executable Code Blocks (ECBs) [6] and the Codewebs 
engine [1]; these exercises can be embedded throughout the 
learning experience to enable rich active learning. 


2. EXECUTABLE CODE BLOCKS 
Executable code blocks (ECBs) [6] enable learners to write 
and execute code directly in their web browser. The primary 
advantage of ECBs is that they can be tightly integrated 
into the course experience. For example, immediately after 
a concept is explained in a video, a learner can be asked to 
implement the specific concept in an ECB. 


ECBs usually employ unit testing strategies to evaluate if a 
learner’s implementation is correct. We extend ECBs such 
that when a learner makes an incorrect submission, they can 
request additional feedback that highlights potential errors 
in their submission and provides hints that guide the learner 
towards correcting these errors (see figure 1). These hints 
are provided efficiently by an instructor through an exten- 
sion of the Codewebs engine. 


Write a function in Octave that estimates @ using regularized linear regression via the Normal 
Equations. 


ction theta = F(X, ¥, Lambda) 
(m, n] = size(X) 
L = eye(n + 1) 


a se ed 


Le te a = 
5 ,% already includes an intercept component, we don’t need to explicitly add an intercept term. 
a ee ee eae NesIctl eee 
6 endfunction 


Sorry your submission did not pass all the test cases. Please review your answer an 


Figure 1: Hints provided in an ECB for an incorrect 
submission. 


3. CODEWEBS ENGINE 


We use the Codewebs engine [1] to localize errors in learner 
code submissions and identify common classes of errors. We 
describe here the relevant process of doing so automatically 
at a high level, and refer the reader to [1] for details. 


The Codewebs engine operates on the abstract syntax tree 
(AST) representation of code submissions. Let n be a node 
in the AST, T;, be the subtree rooted at n, and P,, be the 
subtree rooted at the parent of n. The local context of Ti,, 
denoted by Ti, is P, with T;,, removed (see figure 2). 


We say that JT is a buggy context if submissions containing 
T,, are more likely to be incorrect than by random chance. 
The Codewebs engine declares that P, is a bug if Ty is a 
buggy context but no subtree of T;, has a buggy context. 
Given a bug P,, the Codewebs engine then searches for a 
correction C’ such that replacing P, with C’ results in a cor- 
rect program. 


We extend Codewebs in two ways. First, we modify the 
localization process to consider local contexts that are se- 
mantically equivalent’. This allows us to discover more bugs 
across submissions that might have syntactically distinct but 
semantically equivalent contexts. We also use this to im- 
prove correction discovery in a similar way (see figure 3) 
and improve correction searching to handle instances where 
multiple bugs occur within a submission. 


Second, we introduce the concept of bug groups or error 
modes. Two bugs B and B’ belong to the same group iff B 


"We follow the definition of semantic equivalence used in [1]. 


x%' * (X * theta + y) 


BINARY_EXP_(-) 


X' * (<REMOVED>) 


REWOVED 


BINARY_EXP (*) 
IDENT (X) 


IDENT (y) 
IDENT (theta) 


IDENT (X) 


Figure 2: Left: Subtree P, containing subtree 7; in 
pink. Right: 7), the local context of subtree T*,. 
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Figure 5: Percentage of incorrect submissions by 
number of error modes. 


of a large fraction of incorrect submissions (see figure 5). 
Between 28.6% and 55.0% of incorrect submissions contain 
at least 1 of the 20 most common error modes, and between 
ein (oer) =: 45 36.7% and 61.0% contain at least 1 of the 40 most common 


it r error modes (see figure 5). 


Dome eee 
fone es are oe i: A teaching assistant was recruited to label the top 40 discov- 
YS Tears 2s Yee 3 ered error groups, and we are now running tests to under- 

- stand the effects of this intervention on learning outcomes. 


& 


- w 


emcbnni sneceaeecaaennet 6. REFERENCES 
[1] A. Nguyen, C. Piech, J. Huang, and L. Guibas. 

Figure 4: Instructor tool for exploring common er- Codewebs: Scalable homework search for massive open 
rors based on bug equivalences classes. online programming courses. In Proceedings of the 23rd 

International Conference on World Wide Web, WWW 
Furthermore, we can provide instructors with a tool (see 14, pages 491-502, New York, NY, USA, 2014. ACM. 
figure 4) to explore these common error modes. This tool [2] M. Prince. Does active learning work? a review of the 
orders bug groups by the frequency at which they appear research. J. Engr. Education, pages 223-231, 2004. 
in learner submissions. ‘his enables instructors to quickly [3] K. Rivers and K. R. Koedinger. Data-driven hint 


understand the most common errors made by learners. ‘This 


eee generation in vast solution spaces: a self-improving 
breakdown is useful for course material improvement as they 


python programming tutor. International Journal of 


can expose common learner misconceptions. Artificial Intelligence in Education, 27(1):37-64, 2017. 
[4| R. Singh, S. Gulwani, and A. Solar-Lezama. Automated 

5. RESULTS feedback generation for introductory programming 

We introduced 3 ECBs into the Machine Learning MOOC assignments. In Proceedings of the 34th ACM 

on Coursera involving tasks of varying levels of complexity SIGPLAN Conference on Programming Language 

(e.g., implementing the cost function for regularized linear Design and Implementation, PLDI ’13, pages 15-26, 

regression). Each ECB required between 10 and 20 lines of New York, NY, USA, 2013. ACM. 

code each to solve. [5] J. C. Stamper, M. Eagle, T. Barnes, and M. Croy. 

Experimental Evaluation of Automatic Hint Generation 

For each ECB we collected between 3, 118 and 5, 550 submis- for a Logic Tutor, pages 345-352. Springer Berlin 

sions, consisting of between around 1,000 and 3, 000 distinct Heidelberg, Berlin, Heidelberg, 2011. 

ASTs (see table 1). These submissions were used to train [6] C. Wong. Active learning experiences with code 

the Codewebs model. We find that a relatively small num- executable blocks. 

ber of error groups (40) is required to achieve good coverage https: //building.coursera.org/blog/2016/09/30/ 

sy ; active-learning-experiences-with-code-executable-blocks/. 

It is also possible to show learners automatically generated 


corrections when instructor input is not available. 
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SUMMARY 


As EDM and AIED innovations proliferate, the ability for diverse 
products to consistently interpret each other’s data will emerge as 
a critical issue. Formal data interoperability standards that enable 
diverse datasets to be curated, accessed, merged/compared and 
fruitfully analyzed will play a crucial role in research and in the 
successful mass adoption of products based on that research, as 
will standards that enable systems to produce data that can be 
mined by existing and yet-to-be-invented algorithms. Yet this 
important topic is often neglected by researchers and system 
developers, who naturally focus on the specific problems they set 
out to solve and do not consider how they can either contribute or 
consume data produced by other systems or how their innovations 
will fit into larger ecosystems. This tutorial is intended to: 


e Raise awareness of the role of standards and their criticality 
for EDM and AIED; 


e Provide participants with an understanding of the nature, 
status, and current activity of multiple international standards 
development effort relevant to educational data; 


e Provide participants with insight into how they can 
beneficially apply standards and, in some cases, contribute to 
their development. 


TOPICS 


This tutorial will cover following topics: 


e Why schools, corporations, and government agencies 
require standards conformance in procurement: How 
standards interact with regulations and requirements to 
facilitate the free exchange of information and data, to 
prevent “lock-in” and thereby lower costs, to ensure quality 
and minimal levels of functionality, and to protect the 
integrity and privacy of data. 


e How standards shape product categories and markets: 
How standards can define functionality, product capabilities, 
and market segmentation. In many instances, standards 
determine which of a number of competing approaches will 
dominate. They can shape markets and lead to winners and 
losers and long-term consequences for producers, consumers, 
and researchers alike. There are obvious examples in areas 
such as telecommunications and manufacturing, but there are 
also examples in educational technology relevant to EDM 
and AIED. 


e How standards can support research and lower market 
entry barriers for innovative products: How standards 
make it possible for innovative component technologies to be 
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independently developed without requiring a_ vertical 
monopoly, and how they support research by making it 
possible for data produced by one system to be understood 
by another. 


Types of standards (governance, process, and data 
interoperability): People often think of standards as relevant 
only to technical interoperability, e.g. to determining data 
formats, sizes, shapes, tolerances, and the like. But there are 
other types of standards as well, including process standards 
such as ISO 9001 and Software Engineering Standards and 
governance standards that address issues such as data 
preservation, curation, ethics, and privacy. All of these will 
play a critical role for EDM and AIED. 


International standards organizations: A survey of 
standards development organizations (SDOs). This segment 
will briefly explain the structure of international 
standardization, the principles by which ISO, IEC, IEEE, 
W3C, and similar SDOs abide (openness, consensus, 
balance, due process, right of appeal), the differences (and 
similarities) between these and industry consortia, and the 
SDOs that are most relevant to EDM and AIED. 


How standards are made: The standards development 
process has been refined over many years to ensure that each 
SDO can be productive within its principles and goals. This 
segment will describe how standards development works so 
that participants have an idea of what it entails and how to 
participate. 


A brief history of standards related to educational and 
training technology: Starting circa 1996, various 
organizations and consortia began developing standards, 
some better known and more widely adopted than others. We 
will briefly survey this history with a view towards extracting 
some key “lessons learned” that apply generally to standards 
development: The perfect is the enemy of the good; standards 
are a poor way to define systems but a great way to define 
how they interoperate; simplicity and modularity leads to 
adoption; industry participation is vital; and how to avoid 
standards wars. 


Current international standards activity relevant to EDM 
and AIED: This is a major segment that will touch on a large 
number of relevant standards, including: 


o Metadata standards 
o Format standards (e.g. data shop) 
o Competency and learner information standards 


o Data reporting and curation standards 
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o Platform standards 

o _ Big data and AI ethics 

o Student data governance 

o Possibly needed additional standards 


Each standard will be summarized and described in terms of 
what problem(s) it solves, how it works, who developed it, 
who uses it, how it fits in with other standards, and what the 
presenters see as its future. 


Tools for applying standards to EDM and AIED: This 
segment will focus in on a few high-value standards and 
applications of standards to EDM and AIED. This segment is 
the punchline of the tutorial and will cover the standards that 
the presenters feel are most important. It will focus on 
existing or emerging technologies that participants can apply 
now or in the near future and will provide concrete examples 
of how standards are applied in software. 


o Using standards to report and collect data 


o _ Data set efforts (Datashop, Dataport) 


o The US DoD’s Total Learning Architecture and 
related unification efforts 


How to get involved in the standards development 
process: This last, short segment will provide participants 
with information on how to get involved if they are 
interested, to be followed up offline. 


Questions and Answers: Adequate time will be set aside to 
address participants’ questions and issues. 


Presenter Relevant Bios: 


o  http://ransformingedu.com/speakers/avron-barr/ 
o  http://eduworks.net/robby/ 
o  http://www.xiangenhu.info/ 
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ABSTRACT 


Principal stratification (PS), which measures variation in a 
causal effect as a function of post-treatment variables, can 
have wide applicability in educational data mining. Under 
the PS framework, researchers can model the effect of an 
intelligent tutor as a function of log data, can account for 
attrition, and study causal mechanisms. Participants in this 
tutorial will learn how and when PS works and doesn’t work, 
and will learn three methods of estimating principal effects. 


1. PRINCIPAL STRATIFICATION IN EDM 
RESEARCH 


Educational data miners are increasingly interested in causal 
questions—what interventions work, for whom, and how. 
Accompanying this interest is the widespread realization 
that there is no such thing as “the effect”: actually, effects 
can vary widely between individuals. Estimating the differ- 
ences in effects between types of learners is (in principal) 
straightforward for types defined prior to the onset of an 
experiment. But what about learners who use the software 
in different ways—or, even given the opportunity, don’t use 
it at all? Traditionally, “post-treatment” variables, observed 
subsequent to treatment assignment, are treated as media- 
tors whose analysis requires the kind of untestable assump- 
tions randomization is supposed to avoid. 


Principal stratification (PS) [2] offers a different approach: 
categorizing learners based on how they would (or would 
not) use the software if given the opportunity. Under the PS 
approach, an analyst begins by defining types, or “principal 
strata” of learners based on post-treatment measurements, 
then estimates the probability each learner is a member of 
each stratum (conditional on baseline covariates), and finally 
the average effect of the treatment within each stratum. In 
a randomized experiment, the final step of the process pro- 
ceeds from the randomization (and, possibly, testable mod- 
eling assumptions). That is, researchers need not assume 
unconfoundedness, or that all relevant variables have been 


measured. The result is a principal effect, or separate esti- 
mate of an average treatment effect for each usage mode of 
interest; these may be used to explore causal mechanisms, 
study the conditions under which software might work bet- 
ter (or worse), learn dosage effects (i.e. does more usage 
translate to larger effects), and many other applications. 


1.1 EDM Questions PS may Help Answer 


PS could help address a wide range of research questions in 
EDM. Some examples are: 


e Does the effect of an intervention depend on learners’ 
(measured) emotional state? 


e Are some sections of a software more effective than 
others? 


e Do some learner strategies—such as hint usage or mas- 
tery learning—correspond to larger effects than oth- 
ers? 


e Are there intermediate outcomes, such as mastery speed 
or error rate, that can serve as good surrogates for a 
final outcome, such as a post-test? 


e Estimating treatment effects after attrition 


Each of these questions estimates an average treatment ef- 
fect for a group of learners which is defined based on vari- 
ables measured only after the intervention began. This is 
the type of question principal stratification was designed to 
answer. 


1.2 Estimating Principal Effects 

The catch is that principal effects can be difficult to esti- 
mate. Estimating effects within principal strata depends on 
knowing who is in which stratum—for instance, which stu- 
dents in the control condition would have been frustrated, 
had they been assigned to treatment, or which students 
would have attritted, had they been assigned to the opposite 
condition—which is unobserved and must be inferred. The 
most popular and powerful approach begins by assuming a 
model (typically the normal distribution) for the outcome 
within each stratum and a model for who is in which stra- 
tum (typically logistic regression). Next, it fits a mixture 
model for those subjects with unobserved stratum member- 
ship. For instance, in an experiment comparing students as- 
signed to use an intelligent tutor with students assigned to 
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use traditional curricula, a researcher looking to estimate av- 
erage effects for high-hint users might model post-test scores 
for subjects in the control condition as a mixture of two dis- 
tributions: one for students who would use many hints, and 
one for students who would not. ‘The success of this ap- 
proach depends on the fit of the model—misspecified mod- 
els may yield misleading results—so extensive model check- 
ing is necessary. Further, even when the model is correctly 
specified, its success can depend on factors beyond the re- 
searcher’s control [1]. 


Two other approached depend less on modeling assump- 
tions, but may yield less precise estimates. One approach [3] 
estimates bounds for principal effects, rather than estimat- 
ing the effects themselves. Another [4], applicable in some 
PS studies but not others, uses non-parametric techniques to 
identify plausible candidates for unobserved principal strata, 
and estimates effects based on those. These approaches are 
more “automatic” than the model-based approach, in that 
they do not require careful model fitting and checking, but 
still require researchers to specify the problem carefully. 


13 My Expertise 


For the past three years, I have been working on an NSF- 
funded project to use the PS framework to study data from 
the Cognitive Tutor Algebra I effectiveness study. With Dr. 
John Pane of the RAND Corporation, I have estimated var- 
ious associations between Cognitive Tutor treatment effects 
and student usage. This has produced two EDM proceedings 
papers, [5] and [6]. As part of the project, I have developed 
a new method for estimating principal effects which expands 
on [4] and set of new diagnostic and model checking tech- 
niques. I have also worked extensively with Neil Heffernan’s 
lab using PS to model data from ASSISTments experiments. 


2. TUTORIAL PLAN 


2.1 Introduction to Principal Stratification 
The beginning of the tutorial will introduce the PS frame- 
work. First, we will discuss why principal stratification is 


necessary: participants will learn to distinguish post-treatment 


from pre-treatment variables and understand the concep- 
tual and methodological issues with conditioning causal in- 
ference on post-treatment variables. Next, we will describe 
PS framework, so participants understand how it solves the 
problems with post-treatment conditioning. Finally, we will 
discuss methods for estimating effects within principal strata: 
what assumptions they depend on and the source for their 
identification. We will give a brief overview of the various 
PS methods that we will explore hands on, in more depth, 
during the remainder of the tutorial. 


2.2 Hands on PS Estimation 


The second half of the tutorial will focus on three classes of 
methods to estimate principal effects: nonparametric bounds, 
nonparametric randomization inference, and model based 


PS. 


I will provide two real EDM datasets that participants can 
use for exercises. The first will be a subset of the data from 
the Cognitive Tutor effectiveness study, comparing subjects 
assigned to use the Cognitive Tutor to those assigned to 


traditional curricula. ‘The study produced rich log-data— 
PS can be used to compare treatment effects between sets of 
learners who used, or would have used, the tutor differently. 
The second dataset will come from an experiment run on 
the ASSISTments platform [7]. I will also give participants 
the opportunity to bring their own datasets to the tutorial. 


The methods will be taught in R, a free, open-source lan- 
guage for statistical computing. We will begin with a brief 
introduction to the software: how to read in data, and how 
to write and execute simple code. 


The bounding portion will be based on [3], which describes a 
set of bounds on principal effects, depending on available co- 
variates and certain identification assumptions. We will set 
out a number of real or realistic data scenarios and discuss 
which bounds may be appropriate when. Next, we will use 
R to calculate the appropriate bounds for principal effects. 


The randomization inference portion will be based on [4] and 
extensions I have developed. They depend on the assump- 
tion of monotonicity—that principal stratum membership is 
directly observable for all members of either the treatment 
or the control group. I will provide code in R to estimate 
confidence intervals for principal effects with and without 
covariates the predict stratum membership. 


The model based portion will use Bayesian methods, with 
the JAGS language, via R and the R2Jags package. We will 
practice estimating principal effects with pre-written JAGS 
code (which I will explain) as well as discuss diagnostic tools: 
model checking, convergence diagnostics, and small simula- 
tion studies. 
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ABSTRACT 


With the growing trend of Active Learning, group work is 
becoming increasingly common among education of all ages. 
Among the many advantages of group works, we have also 
witnessed how difficult it is for teachers to keep an eye on the 
activities within each group, thereby turning the group work 
process itself into a black box from the teachers’ perspective. 
In order to propose a solution for this problem, this study 
introduces Whitebox, a device that discreetly gathers several 
types of data within group work, which are then visualized 
for the teacher to reference after the group work. The user 
study with high shcool students showed that group work 
analysis by Whitebox led to deeper understanding of how 
each student performed within their group. 


1. INTRODUCTION 


Considering the fact that there can be more than 30 stu- 
dents in a typical high school class in Japan, it is highly 
difficult for teachers to look over the activities within each 
group during group work. In other words, the students’ pro- 
cesses of their group work remain a blackbox for teachers. 
In addition, how we evaluate group work is still an often de- 
bated issue, especially in formal education where a standard 
evaluation method is required. Whitebox was developed in 
order to suggest a solution towards such obstacles for schools 
in adopting group work. By placing the Whitebox in the 
middle of a group work table, it tracks the activities within 
the group. Later the recorded data will be visualized for 
the teacher to check, enabling teachers to get a rough idea 
of what kind of process each group went through without 
being physically present all the time. Furthermore, White- 
box quantifies the group work process by measuring talking 
ratios, volumes, etc., suggesting novel evaluation measure- 
ment units for group work, which can be used as the future 
standard. 


2. LITERATURE REVIEW 
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While many of EDM / LA related researches have been lim- 
ited to online or digital learning environments, recent studies 
have stepped in to face-to-face classroom activities with the 
help of advanced sensors and devices. Martinez-Maldonado 
et al. [1] created a realtime feedback system for teachers 
to provide feedback just at the right time using the data 
obtained from MTClassroom, a multi-touch tabletop that 
analyzes the strategies of student groups. Evans et al. [2| 
also proposed to identify touch patterns of students on an 
interactive tabletop to analyze the quality of collaboration. 
Whitebox aims to provide similar feedback to the teach- 
ers without relying heavily on each hardware. In terms of 
providing measurement units for conversation and collabo- 
ration, Lederman et al. [3] proposed Open Badges, an open 
source toolkit to measure face to face interaction and human 
engagement in real-time with custom hardware. Olguin et 
al. [4] states that such sociometric badges can make group 
collaborations more efficient by providing context, but such 
badges are mainly used for business and work environments, 
and they must be designed alongside students and teachers 
if it were to be used in a classroom setting. 


3. SYSTEM DESCRIPTION 


Initially, Whitebox used Kinect’s mic arrays to determine 
which direction the audio is coming from, thereby distin- 
guishing who is currently speaking. Following the feedbacks 
from a pilot test, however, audio recording was also done 
with separate pin microphones attached to the students’ 
clothing. The attained audio is processed to obtain the vol- 
ume as well. Using Kinect’s depth camera, Whitebox also 
obtains the participants’ body skeletons, allowing it to track 
their hand coordinates and their posture angles. Due to the 
way the current system is designed, Whitebox can only track 
the participants’ data when they are sitting down and are 
not moving around or switching positions. The entire group 
work is also recorded, and when the group work is finished 
the audio data is converted into text using Google Cloud 
Speech API. 


4. USER STUDY 


A user study was conducted during a 4 day Design Thinking 
workshop at Tokyo Metropolitan College of Industrial 'Tech- 
nology high school. In this user study, we especially focused 
on one group of four students, student A, B, C and D, and 
recorded only those 4 students’ activities. After the work- 
shop, the 4 visualizations and speech-to-texts were shown to 
both teachers and students separately, followed by an hour 
long discussion each on what those data meant to them. 
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Figurel shows the study setup. 


Figure 1: User study setup 


The data acquired from the 4 workshops was processed, then 
visualized in to A4 infographic posters as shown in Figure 
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Figure 2: visualizations from user study 


To provide a more fine grained analysis of each session, 
we also provided additional visualization that plotted the 
students’ audio data, posture data and hand position data 
along the timeline of the workshop. Figure 3 is an examples 
of the additional visualization. 
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Figure 3: additional visualization of student D from day 3 


By visualizing the data from all four sessions, it was possible 
to get a grasp of how each student behaved in the workshops. 
It is important to note here that what the visualizations sug- 
gested matched with the thoughts of facilitators who were 
in charge of this group (e.g. that student D would speak 
the least and student B would take charge of the overall 
discussion), meaning that Whitebox would be able to as- 
sist teachers to evaluate group work without them having 
to be present at each group’s table all the time. As for 
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the speech-to-text, it helped the teachers to see what words 
were mentioned most frequently. With improved conversion 
accuracy, it would become possible to process the text to 
search the most frequenty mentioned conjunctional phrases 
per student in order to see the characteristics of their con- 
tributions. 


By post processing the audio data recorded, we were also 
able to provide visualizations on the order of conversational 
turn taking during the discussion. The data was plotted for 
each 30 seconds of conversation. This enables the teacher 
to examine specific points in a discussion and analyse how 
it transitioned between the group members. An example is 
shown in Figure 4. 
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Figure 4: conversation transition 


5. CONCLUSION 


In this study we proposed Whitebox, a device that tracks the 
activities within a group work. Through the discussions with 
the teachers, we were able to see that Whitebox analysis cer- 
tainly functioned as a guideline for a deeper understanding 
of the group and its students, and it also functioned as signs 
for what was and was not working in the group work, ulti- 
mately leading to improvements in the design of the class. 
Although not all the data we recorded seemed useful to the 
teachers, the measurements that Whitebox proposed, espe- 
cially talking ratios, volumes and posture were valuable in- 
formation for the teachers, uncovering the activities within 
the group that they otherwise would have missed. By using 
these measurements continuously, they can become a stan- 
dard measurement unit in assessing group work. 
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ABSTRACT 


Paper-based assessment is still one of the most preferred methods 
in assessing students in a blended learning environment. However, 
it has several drawbacks such as having a high turnaround time 
before feedback is provided to the students. Furthermore, 
understanding how students attend to their graded papers 1s difficult 
to investigate because of the absence of empirical evidence. We 
describe in this paper a web-based system we developed that 
addresses some key issues when trying to understand the reviewing 
and reflection behaviors of the students. This system also aims to 
help instructors to efficiently and effectively grade paper-based 
assessments. 


Keywords 


Reviewing Behavior, Paper-Based Assessment, Educational 
Technology 


1. INTRODUCTION 


Paper-based assessment is still one of the most preferred methods 
in assessing students in a blended learning environment. Aside 
from being convenient to prepare, the possibility of students 
committing academic dishonesty 1s lower. However, it also has its 
drawbacks. Evaluating large amounts of test paper gives rise to the 
possibility of inconsistency among or even within graders [2]. 
Additionally, the feedback is limited [5]. Moreover, there is a high 
turnaround time before students receive their graded papers [1]. In 
terms of understanding the reviewing and reflecting behaviors of 
the students, it is difficult to systematically estimate how students 
review their paper-based assessments because of the absence of 
empirical evidence. It is not possible to determine whether students 
really do review their graded test papers. Thus, it is challenging to 
estimate the impacts of reviewing on learning. 


2. WEB-BASED PROGRAMMING 
GRADING ASSISTANT (WPGA) 


A web-based system was developed to address the above- 
mentioned issues. More specifically, it is designed to help students 
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to review effectively. In addition, it aims to help instructors to 
efficiently and effectively grade paper-based assessments. The 
name of the system is Web-based Programming Grading Assistant 
(WPGA). The system is capable of capturing all activities 
performed by the users, which is mostly comprised of students’ 
clickstream. 


2.1 Documentation of Paper-Based 


Assessments 

WPGA uses quick response (QR) codes to label the paper exam of 
a student. These generated codes are manually placed on the 
students’ papers prior to scanning. Using an automatic document 
feeder, all the papers are scanned and uploaded to the system. The 
system automatically associates the scanned image to the 
corresponding student and the corresponding assessment. There are 
instances where the system may not accurately associate an image 
to a student. One possible reason would be due to the QR code 
being not readable. It could also be because the student is not 
registered in the system. When this happens, the instructor can just 
manually label the images. 


2.2 Interface for Grading Assessments 

After the exams are digitized, instructors can distribute the 
questions to be evaluated by different graders. The system allows 
multiple graders to work on the same assessment simultaneously. 
In effect, the turnaround time in the distribution of grades is 
reduced. The grading coherence will improve since graders will 
only be working on the question assigned to him or her. 
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Figure 1. The grading interface of WPGA 


The grading interface is shown in Figure 1. Buttons on the upper 
right portion represent a learning concept or a rubric that is used to 
evaluate a question. Every rubric default to a perfect score, which 
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translates to a full understanding of the concept. Whenever the 
button is clicked, the grade for the rubric is decremented and the 
overall score is recalculated. Also, the color of the button changes 
depending on the grade for the rubric. It could be blue (full 
understanding), red (partial understanding), or grey (missed the 
concept). The overall score can also be overridden, if necessary. 
The graders can also add markings on top of the student’s paper. 
This will enable them to highlight the mistakes. Lastly, using the 
comment section, the graders can provide free form feedback. In 
previous studies [2,3], we found out that graders prefer to type their 
feedback rather than physically writing them on paper. One 
advantage of this over the traditional way of checking 1s the ability 
to copy and paste feedbacks of common and similar mistakes. 


2.3 Interface to Encourage Student Reflection 
After the instructor publishes the results of an assessment, the 
students can log in to the system and review it. There are two levels 
how the students can view the results: assessment level and 
question level. In the assessment level (shown in Figure 2), the 
general result is displayed. This includes the overall score obtained 
by the student along with the individual scores for each question. 
In the question level (shown in Figure 3), a detailed feedback for 
the particular question is provided. This includes the scores for all 
the rubrics, the markings on the student’s paper, and the free form 
text provided by the grader. 
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Figure 3. The question level view of the student interface 


In addition to letting the students access a digital copy of their paper 
assessments, the system also allows them to reflect on the feedback 
given to them by the graders. We incorporated some features that 
help students track and monitor their learning. For example, in the 
question level, there is a checkbox that the students can tick to 


indicate whether they already know how to solve the problem after 
reviewing it. This is particularly useful for questions where they 
committed mistakes. Another feature is the bookmark which 
enables students to highlight the importance of a question. This 
could be used in future targeted reviews along with the use of 
filters. We also provided a free form text area to allow the students 
to type in his or her personal notes. The collection of these 
bookmarks, checkbox ticks, and notes are externalization of what 
the student knows. Through these features, it is hoped that students 
will be encouraged to reflect on their answers. 


3. CASE STUDY 


Using the system, we designed a classroom study and analyzed the 
logs collected from an Object-Oriented Programming and Data 
Structures class. We tracked and modeled students’ reviewing and 
reflecting behaviors. Results show that students demonstrated an 
effort and desire to review assessments regardless whether they are 
graded or not [4]. 


4. FUTURE WORK 


We intend to improve the system by using the feedback obtained 
from the users. For the next iteration, we are integrating the 
analytics module that will enable the instructors to quickly see a 
snapshot of the class performance and will enable them to gain 
insight on the assessments they gave to the students. Furthermore, 
we intend to do more research in understanding the reviewing 
behaviors of the students. This would allow us to create 
personalized review sessions that will help students do effective 
reviews. 


5. REFERENCES 


[1] Susan A. Ambrose, Michael W. Bridges, Michele DiPietro, 
Marsha C. Lovett, and Marie K. Norman, How Learning 
Works: Seven Research-Based Principles for Smart 
Teaching.: John Wiley & Sons, 16 April 2010. 


[2] I.-Han Hsiao, "Mobile Grading Paper-Based Programming 
Exams: Automatic Semantic Partial Credit Assignment 
Approach," in Lecture Notes in Computer Science., 2016, pp. 
110-123. 


[3] I.-Han Hsiao, Sesha Kumar Pandhalkudi Govindarajan, and 
Y1-Ling Lin, "Semantic visual analytics for today's 
programming courses," in Proceedings of the Sixth 
International Conference on Learning Analytics & 
Knowledge (LAK'16), 2016. 


[4] I.-Han Hsiao, Po-Kai Huang, and Hannah Murphy, 
"Uncovering reviewing and reflecting behaviors from paper- 
based formal assessment," in Proceedings of the Seventh 


International Learning Analytics & Knowledge Conference, 
2017, pp. 319-328. 


[5] Hannah E. Murphy, "Digitalizing Paper-Based Exams: An 
Assessment of Programming Grading Assistant," in 
Proceedings of the 2017 ACM SIGCSE Technical 
Symposium on Computer Science Education, New York, NY, 
USA, 2017, pp. 775-776. [Online]. 
http://doi.acm.org/10.1145/3017680.3022448 


Proceedings of the 10th International Conference on Educational Data Mining 423 


Doctoral Consortium 


A Framework for the Estimation of Students’ Programming 
Abilities 


Ella Albrecht 
Institute of Computer Science 
University of Goettingen 
Géttingen, Germany 


ella.alorecht@cs.uni-goettingen.de 


ABSTRACT 


In times of increasing numbers of students and high usage 
of e-learning systems, student models are a good way to 
get an overview of what is currently occurring in the class- 
room, analyze students’ behavior and estimate their learn- 
ing progress. In our work, we develop a framework which 
estimates a student’s programming knowledge by looking 
at his responses to open-ended programming assignments. 
The model we construct incorporates multiple applications 
of multiple skills in one exercise, multiple submissions and 
varying knowledge components involved in the same exer- 
cise. 


1. INTRODUCTION 


During the last years, the number of students has increased 
rapidly. Especially in introductory courses, hundreds of stu- 
dents are attending. This makes it infeasible for educators 
to take care of each student individually. On the other hand, 
to deal with large amounts of students, many institutes use 
e-learning and e-assessment systems to support their teach- 
ing. ‘These systems allow large data collection on which data 
mining and learning analytics techniques can be applied to 
build student models. Student models are used to estimate 
a student’s cognitive state, e.g., his/her motivation, knowl- 
edge, misconceptions or learning style and preferences [4]. 
A student model can be used to provide students personal- 
ized course material fitting to their current knowledge and 
learning habits. Furthermore student models can be used to 
predict student’s performance and identify students which 
are at risk to intervene in a timely manner. Besides, we can 
use a student model to identify problematic course contents. 
This knowledge can be used as a basis for restructuring and 
redesigning the course. 

In our research, we want to develop a framework for the 
estimation of student’s knowledge regarding programming. 
Therefore, we look at students’ solutions to open-ended pro- 
gramming exercises. For each exercise, it is defined which 
knowledge components (KC) are required to solve the exer- 
cise correctly. KCs describe the individual components of 
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knowledge which are required to solve a particular task or 
problem. ‘The task in an introductory programming course 
is to learn to write simple programs which meet the spec- 
ifications given in text form, i.e., the exercise description. 
Therefore KCs can be, e.g., the programming language’s 
constructs, i.e., syntax and semantics, correct usage of a 
compiler or IDE, error understanding and debugging ability, 
or the translation of specifications to program code. Then, 
it is checked whether the student has applied the KCs in 
his/her solution correctly. From theses observations a stu- 
dent model can be constructed which is able to estimate a 
student’s knowledge state. 


2. PROBLEM STATEMENT 


Knowledge cannot be assessed directly, because there may 
be several reasons why a student made a mistake. For exam- 
ple, a missing break in a switch-case-block may be just due 
to sloppiness, because the student does not know the break- 
statement, or because the student does not understand how 
the commands in a switch-case-block are executed. Be- 
cause of these uncertainties often probabilistic models are 
used for student modeling. 

Bayesian Knowledge Tracing (BKT) [5] is one of the most 
widely spread student modeling approaches. It uses Hidden 
Markov Models to model students’ learning. It was at first 
applied to programming exercises for LISP in the ACT Pro- 
gramming Tutor. The domain knowledge was represented 
by production rules of the form ”to achieve goal X do Y” 
where Y may be a subgoal. The knowledge of a student was 
described as the probability that the student knows a rule. 
Since there was a deterministic order of which rules need to 
be applied to solve an exercise correctly, the student’s knowl- 
edge could be estimated by looking at the student’s solutions 
rules order. But in imperative or object-oriented languages 
like C, C++, or Java one can only extremely rarely define a 
deterministic order of statements. 

Kasurinen and Nikula [7] have applied BKT on students’ re- 
sults to Python exercises. As domain knowledge they have 
defined guidelines for preferred solutions, e.g., each open file 
should be closed. Moreover, they have checked whether the 
student has used the guideline in his/her solution. However, 
the set of KCs was very limited. 

Berges and Hubwieser [2] as well as Yudelson et al. [10] 
used the Rasch model from Item Response Theory (IRT) to 
estimate student’s knowledge of object-oriented concepts in 
Java instead. In IRT, the relationship between responses to 
items, i.e., exercises, and a latent trait, i.e., an ability or 
KC, is described as a logistic function. Different from BKT, 
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it also takes the difficulty of an item into account. 

BKT as well as IRT have the main drawback that they are 
single skill models, i.e., for each KC a separate model is con- 
structed, and it is assumed that each exercise only requires 
one KC. For programming assignments, this assumption is of 
course not sustainable. Performance Factor Analysis (PFA) 
[8] is able to deal with multiple skills per exercise but as BKT 
and IRT also does not consider dependencies between KCs. 
However, in the programming domain there are dependen- 
cies between KCs, e.g., one needs to know how assignments 
or incrementing works when using a for-loop, or that the 
knowledge of a while-loop can influence the knowledge of a 
for-loop. It was also shown that integrating dependencies of 
knowledge into a student model can improve the model [3, 
6]. Another special property of programming assignments 
is that KCs can be required multiple times in one exercise, 
e.g., 1f multiple loops are needed to solve the exercise. We 
also want to investigate the influence of substeps during the 
solution process to a model’s accuracy. To the best of our 
knowledge, there does not exist a modeling approach so far 
which fulfills all of the requirements for programming as- 
signments we have stated above. 


3. RESEARCH METHODOLOGY AND AP- 
PROACH 


Before we can make use of a student model in a course, 
several steps have to be taken. First, we need to identify 
what we expect the students to learn in our course, i.e., 
which KCs shall be acquired. In the first iteration of our 
research, the KCs we want to use for our model are the con- 
cepts of the programming language, e.g., if, for, variables, 
arrays etc., rules for good programming practice, e.g., each 
declared variable shall be used, allocated memory has to be 
freed, etc., as well as the fulfillment of the specifications by 
checking whether the program produces the correct output. 
In a second step, we need to know which KCs are required 
to solve a particular exercise as we want to build our stu- 
dent model from the data we gain from their solutions to 
programming assignments. For example, summing up the 
numbers from 1 to 100 requires among others the knowl- 
edge of loops or recursion. This example also shows us, that 
it is actually not that easy to define which concrete concepts 
are really mandatory to solve the exercise as we could write 
a correct solution without knowing loops if we know recur- 
sion and vice versa. In our work, we develop a knowledge 
requirements model (KRM) which models required KCs re- 
lated to language concepts for a particular exercise. The 
general mapping of language constructs, e.g., elements of an 
abstract syntax tree (AST), to concrete KCs has to be done 
beforehand by a domain expert. The KRM for a particular 
exercise is learned automatically from different correct so- 
lutions to that exercise based on their AST's and structural 
analysis. We divide correct solutions into blocks and deter- 
mine the set of KCs used in the block. From these sets we 
construct a tree where each path describes an alternative so- 
lution. By comparing a student’s solution to the KRM, one 
can get the KCs which were applied correctly, incorrectly or 
are missing in the student’s solution. 

Despite the comparison with the KRM, we also use compiler 
and static analysis tool messages to assess the incorrect ap- 
plication of a KC, e.g., static analysis tools can deliver hints 
on, e.g., misunderstanding of control flow. Dynamic tests 
like unit tests, help us to evaluate a student’s general pro- 
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Figure 1: Example structure for a part of a DBN 
student model 


gram writing ability, i.e. whether a student is able to write 
a program which meets the specifications, i.e., does what it 
is intended to do. 

The third step deals with the construction of the student 
model. We use Dynamic Bayesian Networks (DBN) for stu- 
dent modeling as they seem most appropriate to us. A DBN 
is a two-time-sliced Bayesian network where the state of a 
hidden variable depends on the states of the variables it de- 
pends on and the variable’s state in the previous time step. 
Making observations in each time step updates the proba- 
bility distribution of a hidden variable being in a particular 
state. 

In our case, the hidden variables are the KCs, e.g., in Fig- 
ure 1 the hidden variables (blank circles) are the concepts 
types, variables, and assignments. Observations in our stu- 
dent model are the results from the comparison of the stu- 
dent’s solution with the KRM, compiler and static analysis 
tool messages as well as results from dynamic tests, e.g., 
in Figure 1 the observations (filled circles) are whether the 
student has declared and initialized a variable as well as 
whether an error message regarding incompatible types in 
an assignment appears. ‘These variables can have the states 
true or false. With DBNs, we are able to deal with multiple 
KCs per exercise, their interdependencies, the uncertainty 
of which KC is affected by a certain observation and the 
uncertainty of which KCs are required to solve a particular 
exercise. 

In our work, the structure of the DBN is defined manually 
by a domain expert. Though, one could also learn dependen- 
cies between KCs from data. The parameters of the DBN 
are learned from data using an expectation maximization 
algorithm with reasonable parameter constraints defined by 
an expert, e.g., mits for guess and slip probabilities. One 
problem that may occur, is that the parameter space is too 
large and we get computational problems when estimating 
the parameters of the model, if we use a very fine-grained 
KC definition. Therefore, we need to evaluate which granu- 
larity to choose to be able to estimate the parameters and 
still have an accurate model. Furthermore, we have to rea- 
son how to integrate multiple occurrences of the same KC 
in one exercise. Possible treatments are, e.g., majority vote 
or using uncertain evidences with a probability according 
to the ratio of correct/incorrect applications. We also want 
to analyze, whether multiple submissions, i.e., substeps pre- 
ceding the final solution, improve the model. 

In the second iteration of our research, we want to add fur- 
ther KCs which concentrate on more cognitive skills. The 
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first one is the debugging ability, which we want to assess 
by comparing two subsequent submissions when the first one 
indicates an error (or a failure) and check whether the prob- 
lem was fixed. 

As a further KC, we want to include variable roles [9]. Vari- 
able roles describe patterns of variable usage. They are de- 
fined by the successive values the variables obtain. An ex- 
ample for a role would be the most-wanted holder which is 
a variable that holds the best value encountered so far when 
going through a succession of values, e.g., when searching 
the smallest value in an array. The proper collocation of 
variable roles is essential for solving a task or achieving a 
goal in a program. Usually, students intuitively use variable 
roles in their programs. The lack of knowledge of a particu- 
lar role could explain why a student may have problems to 
solve an exercise. 

We want to evaluate our model by comparing it to common 
student modeling approaches like BKT, IRT and PFA. 

In a last step, we want to analyze the model constructed 
from the data of our introductory C course to find out what 
students which are at risk have in common, which KCs seem 
most difficult to the students and how many exercises are re- 
quired at least (on average, to reach a particular percentage 
of students) to gain sufficient knowledge in a certain KC. 


4. CURRENT STATUS & NEXT STEPS 


We have implemented a framework for the collection of met- 
rics regarding students’ solutions [1] which was successfully 
introduced in our introductory C programming course. It is 
mainly an e-assessment system where students can upload 
their solution and get some basic feedback. It collects com- 
piler messages, results from static analysis tools, and results 
from dynamic tests to capture the correctness of the solu- 
tion. In the first year, we got about 10,000 submissions of 
on average 250 students. We expect similar numbers this 
year. 

Furthermore, we have identified the different KCs that we 
have in our course by going through the course material and 
previous programming errors of students. Based on that, 
we defined a hierarchical structure of KCs where the sinks 
are basic observations in form of rules like, e.g., the function 
returns a value if the return type is not void. We have also 
mapped compiler/static analysis tool messages to different 
concepts and implemented an AST parser. In a next step, 
we want to use the AST to filter the KCs from source code 
and construct our KRM. 

Next, we plan to conduct a small case study with only a few 
KCs to evaluate the feasibility of our DBN student model. 


5. EXPECTED CONTRIBUTIONS 


In our work, we develop a framework for the estimation of 
students’ knowledge regarding programming. One of our 
main contributions is the definition of a student model which 
has the following properties which are needed to construct 
the model based on solutions to programming assignments: 
multiple KCs per exercise are possible and their interde- 
pendencies are considered, uncertainty of affected KCs can 
be handled, individual KC requirements and usages can be 
treated, multiple submissions can be integrated, and a KC 
can be used multiple times in the same exercise. 

Another contribution will be a KRM which is automatically 
generated from model solutions for each exercise and can be 
used to evaluate which KCs were applied correctly or incor- 
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rectly by the student. 

Furthermore, we plan to not just look at language related 
KCs, but also more cognitive skills like, e.g., debugging abil- 
ity. We hope that our model helps to get better insights into 
the learning process of students. 

From the doctoral consortium we expect to get some feed- 
back on our student model, especially hints for the eval- 
uation w.r.t. metrics and data sets. We are also looking 
forward for further ideas for additional or alternative KCs 
which we can integrate in our model. 
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ABSTRACT 


My research focuses on the integration of science and de- 
sign through the use of interactive simulations and other 
scaffolding tools. I specifically look at patterns of use in 
interactive simulations. ‘To conduct this research, I have 
developed a curriculum about solar ovens used by middle 
school students, during which students are guided by an 
online curriculum to design, build, and test physical solar 
ovens. ‘This curriculum utilizes interactive simulations as a 
tool to help students plan the design for their solar ovens. I 
have evaluated scaffolding for the simulation steps, and plan 
to evaluate other patterns of student use, based on action 
log data. 
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1. RESEARCH TOPIC 


My research focuses on the integration of science and design 
through the use of interactive simulations and other scaffold- 
ing tools. I specifically look at patterns of use in interactive 
simulations. I conduct this research in secondary schools, 
and work in collaboration with teachers. Through my dis- 
sertation work, I aim to answer the following questions: 


e What types of use patterns in interactive simulations 
are beneficial for integrating science and design learn- 
ing? 


e How can we use tools to support integrated under- 


standing in writing activities (e.g.,automated guidance)? 


My work is situated in the learning sciences, using tech- 
niques from educational data mining and artificial intelli- 
gence to understand how students’ activities impact their 
learning and how to improve the learning experience. Re- 
cently, I have used natural language processing to develop 
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automated classifiers for multiple short response questions 
[6]. Using these classifiers, I plan to develop automated guid- 
ance for student writing during the curriculum, which will 
deploy during spring 2017. I have also studied student use 
of interactive simulations, using log data, feature engineer- 
ing, and clustering to make sense of patterns (submitted to 
EDM 2017). 


To conduct this research, I have developed a curriculum that 
is run using an online platform and offers students the op- 
portunity to use interactive simulations while they design a 
physical artifact. In previous work, I have found that the 
simulation is beneficial, especially when students use it dur- 
ing the design phase of the curriculum [8]. My work has also 
been published in a variety of other conference venues [7, 10, 
11, 9]. 


1.1 Curriculum 

My research utilizes a curriculum about solar ovens that 
is run using the Web-based Inquiry Science Environment 
(WISE). During this curriculum, students design, build, and 
test a solar oven. They go through the design, build, test 
process two times to get an idea of how engineers iterate 
on their designs based on results from testing (Figure 1). 
This curriculum was designed using the knowledge integra- 
tion framework [5]. The knowledge integration framework 
has proven useful for design of instruction featuring dynamic 
visualizations [14] and engineering design [1, 12]. The frame- 
work emphasizes linking of ideas by eliciting all the ideas 
students think are important and engaging them in testing 
and refining their ideas [5]. 


Students are allowed to use only a certain set of materials 
(e.g., tin foil, black construction paper, plastic wrap, Plexi- 
glas, tape), in addition to a cardboard box they bring from 
home. Students use an interactive computer simulation to 
test the different materials in their oven. This simulation 
helps to elicit student ideas before they get to the building 
process, consistent with the knowledge integration frame- 
work. The testing portion of the project allows students to 
distinguish their ideas. 


Throughout the project, students respond to short response 
questions about the choices they are making in their design 
and how their ovens work. This curriculum is unique, since 
it is guided by an online platform, but students also design, 
build, and test their solar ovens in a hands on portion of the 
project. 
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Figure 1: Outline of the solar ovens curriculum 


The curriculum takes between 10-15 class periods ( 45 min- 
utes per class period). Students complete this project in 
groups of 2 or 3 students. Students also complete a pretest 
the day before the project begins and a posttest the day after 
completing the solar ovens project. Students do the pretest 
and posttest individually. The pre-/posttests measure stu- 
dent understanding of science concepts and practices. 


1.2 Interactive Computer Simulation 

The interactive simulation (figure 2) was built using NetL- 
ogo [15]. Students can manipulate the simulation in a num- 
ber of ways. They can change the cover on top of the oven, 
whether or not there is a reflective flap on top of the box, 
the shape of the box (wide and short or skinny and tall), and 
the albedo (reflectivity) of the inside of the box. Students 
may also manipulate the speed at which the simulation runs. 
Once a simulation runs to the end of the graph (10 simu- 
lated minutes), a new row is added to the table below the 
visualization with the settings and results from the trial. If 
the students do not allow the simulation to run until the 
simulated 10 minutes finish, nothing is added to the table. 


The scaffolds we developed for the interactive simulation are 
twofold; short response questions direct students to investi- 
gate capabilities and limitations of the simulation and an 
automatically generated table helps students to keep track 
of trials they have run. The table includes information about 
all of the settings used in that trial, as well as the results of 
the trial at certain time points (e.g. 5 minutes, 10 minutes). 


2. PROPOSED CONTRIBUTIONS 


Making sure students use interactive simulations to aid in 
learning is a difficult task. To try to encourage students to 
take advantage of these simulations during learning, various 
scaffolding methods have been used. Often, these scaffolds 
are implicit, or built into the system with the simulation 
[13]. For example, guiding questions are used with inquiry 
simulations to direct students’ attention toward certain fea- 
tures of simulations [4]. Students are also often encouraged 
in science classes to run multiple trials and control variables 
between trials (only change one variable between trials). A 
control of variables strategy can help students to determine 
the effect of a single variable on a more complex system, 
although in some cases students may benefit from more ex- 
ploratory strategies [12]. 
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Tiret Greet) 
F Bou Shape Cover Type Flap? IntAlpedo Temp at Om Temp at tm Temp at Sm Temp at 10m 
1 Wide and 5 None No $0 20 24 Wz 31.5 
2 Skinny and None No 50 20 26.2 20.7 31.1 
3 Wide and = Plexiglass No 50 20 25.6 55.4 53.4 


Figure 2: The interactive simulation used by stu- 
dents to test solar ovens and visualize energy trans- 
formation; below the table simulation is output from 
the automatically generated table 


Using log files from student interactions with the curriculum 
and output from the automatically generated tables (simula- 
tion scaffolding), we use feature engineering to identify how 
students use the model and whether these uses have an im- 
pact on learning. I developed features that have to do with 
the control of variables strategy, such as the number of trials 
(rows) a student runs and the percent of those trials that are 
systematic. These types of techniques have also been used 
with more complex simulations and microworlds (e.g., [8, 
2|). We use results from pre- and posttests to assess student 
learning in tandem with the log data from the curriculum. 


The data in this work comes from 635 students across three 
schools and five teachers. During this study, students par- 
ticipated in a pretest and posttest (each lasting one class 
period), as well as the 2-3 week long curriculum. During 
the curriculum, students worked in teams of 2-3. These 635 
students formed 255 teams. 


3. RESULTS 


I used pretest and posttest scores to understand the effect of 
actions with the simulation on learning. I then examined the 
role the number of rows of data a student generated using 
the table scaffolding on learning. I found that the number 
of rows generated in iteration 1 of the simulation is a signifi- 
cant predictor of individual posttest scores, when controlling 
for pretest scores and curriculum group (b = 0.10, t(546) = 
2.68, p < 0.01). Next, I examined the impact of controlling 
variables on learning. I found that the number of Control Of 
Variables (COV) Trials run, however, is not quite a signifi- 
cant predictor of posttest score, when controlling for group 
and pretest score (b = 0.06, t(546) = 1.63, p = 0.10). In 
addition, using a dummy variable for conducting any COV 
Trials does not significantly predict posttest scores when 
controlling for pretest scores and group (b = 0.005, t(546) 
= 0.13, p = 0.90). Together, these results indicate that the 
control of variables strategy, while a good practice in sci- 
ence, is not as helpful for developing an understanding of 
the scientific principles at play in a simulation. More ex- 
perimentation using the model is beneficial for developing a 
better understanding of the scientific concepts. 


I then split the students up based on their actions during the 
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simulation step (did not generate any rows in table, gen- 
erated one row, generated 2 or more rows). I found that 
generating 2 or more rows in the table significantly predicts 
posttest scores, when controlling for pretest score and work- 
ing group (b = 0.12, t(546) = 3.11, p < 0.01), though gen- 
erating no rows or 1 row were not significant predictors. I 
also developed a variable, Percent Systematic, that is the 
percentage of the total rows a group generated that used 
the control of variables strategy. This variable has the abil- 
ity to show more nuance in how students were employing 
the control of variables strategy, but was also not predictive 
in determining posttest scores, when controlling for pretest 
and group id (b = 0.05, t(508) = 1.32, p = 0.188). 


There were also two short response scaffolding questions on 
the same step as the interactive simulation. I generated a 
variable based on the number of questions students answered 
(0, 1, or 2). This was predictive of posttest score, when 
controlling for pretest score and group id (b = 0.10, t(546) 
= 2.56, p = 0.011). 


Overall, evidence suggests that students should be encour- 
aged to experiment with the model and guided to produce 
at least two rows of data in the table to improve learning 
outcomes and use the short response questions. Perhaps 
changing more than one variable at a time in this type of 
environment indicates that students are spending more time 
thinking about possible outcomes. I have further examined 
this data using k-means clustering algorithms. 


4. FURTHER QUESTIONS 


I have finished the majority of data collection for my disser- 
tation. I will conduct one more study during the spring of 
2017, and there will be the potential for a follow-up study 
later. This is an important time for me to get feedback on 
my work, especially on the analysis of the action log data I 
have collected from over a thousand students. I will begin 
the writing phase of my dissertation work during the sum- 
mer, and expect to complete my dissertation within the next 
12 months. 


During the doctoral consortium, I would like to discuss the 
following: 


e How to assess patterns in student actions in interactive 
simulations (Tools and packages for doing this and as- 
sessment of what it means to be a meaningful pattern) 


e Designing studies that integrate education theory and 
data mining 


e Assessment of inquiry skills in online environments 


Use of event logs in online curriculum to assess student 
use of curriculum and how this can be used to assess 
learning in tandem with other methods 
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ABSTRACT 


The present study aimed at proposing a Chinese automated essay 
scoring model to assess college students writing quality. Thirty- 
one related Chinese linguistic indicators were developed based on 
Coh-Metrix indices and characteristics of Chinese texts. Essay 
collected from 277 college students were analyzed using 
automated Chinese text analyze tool. A stepwise regression was 
used to explain the variance in human scores. The number of 
words, number of low strokes, content words frequency, minimal 
edit distance (all words) and minimum frequency for content 
words predicted 55.8% variance in human scores. On the other 
hand, seven indicators: number of words, content words 
frequency, concreteness, Measure of Textual Lexical Diversity, 
minimal edit distance (part of speech), minimal edit distance (all 
words) and words per sentence were predictive of human essay 
ratings by using discriminant analysis. The present study further 
explored the effectiveness of the Chinese automated essay scoring 
model by using three different methods: stepwise linear regression, 
discriminant analysis, and Nonparametric Weighted Feature 
Extraction classification (NWFE). The preliminary results showed 
that NWEFE classification method produced higher exact matches 
(51.3%) between the predicted essay scores and the human scores 
than stepwise regression (47.3%) and discriminant analysis 
(47.3%). 


Keywords 


Chinese automated essay scoring, writing quality, NWFE 
classification, Chinese linguistic indicators 


1. INTRODUCTION 


Essay scoring has traditionally relied on expert raters. These 
scoring methods need to spend more time and a large amount of 
human scoring. Based on these limitations, automated essay 
scoring becomes the important research for essay assessment. 
According to the results of past studies, automated essay scoring 
reported perfect agreement (i.e., the exact match of human and 
computer scores) from 30-60% and adjacent agreement (i.e., 
within | point of the human score) from 85-99% [1]. Moreover, 
recently the study of analyzing the scored essays using Coh- 
Metrix has increased noticeably [2, 4, 5, 6, 7, 8, 13, 14, 15]. Coh- 
Metrix is an automated text analysis tool that provides lots of 
different linguistic indices [10]. The tool can provide these 
indices by combining lexicons, a syntactic parser, and several 
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other components that are widely used in computational 
linguistics. 


Chinese language features in the characteristics of different from 
the English, cannot be directly applied to the Chinese essay 
writing. Most of the experts will consider the following sections: 
Number of words, structure organization, vocabulary 
diversification, typos, and punctuation. Based on _ the 
development of Coh-Metrix, automated text analyze tool were 
developed in Chinese. Totally 66 Chinese related linguistic 
indicators were used to analyze the characteristics of Chinese 
texts [12]. 


Writing the literacy assessment is an important standardized 
testing to assess college students’ writing skill in Taiwan. The 
assessment is to detect whether students can express personal 
comments on specific issues. Students need to read an article, 
respectively, and express personal comments by writing the essay 
in two hundred words. These essays were scored by two experts 
and score from 0-5. However, we need to a lot of experts and 
spend more time to score. To propose a suitable automated 
scoring model is important and needed. 


2. PROPOSED CONTRIBUTIONS 


The purpose of the study is to explore the characteristics of 
Chinese writing and propose a suitable Chinese automated essay 
scoring model to assess college students writing quality. Past 
studies explored the variety of human scoring were predicted by 
different text features using regression analysis. Moreover, they 
proposed automated essay scoring model and examined the essay 
matches by linear regression and discriminant analysis. A 
Nonparametric Weighted Feature Extraction (NWFE) 
classification method was also used to examine the essay matches 
in the present study. 


Nonparametric Weighted Feature Extraction (NWFE) is based on 
a nonparametric extension of scattering matrices. It could reduce 
parametric dimensional and increase classification accuracy [11]. 
The present study used linear regression analysis and discriminant 
analysis of the gradual selection of variables for the NWFE 
classification method and examine the accuracy of essay matches. 


3. Method 


3.1 Text Indices Selection Procedure 
The present study collected Chinese essay from college students 
in Taiwan. All essay was analyzed by Chinese automated text 
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analyze tool. The tool provides 62 Chinese linguistic indices, 
includes basic text measures (e.g., text, sentence length), words 
information (e.g., word frequency, concreteness), cohesion 
(semantic and lexical overlap, lexical diversity, along with the 
incidence of connectives), part of speech and phrase tags (e.g., 
nouns, verbs, adjectives), and syntactic complexity (e.g., 
Sentence syntax similarity, Minimal Edit Distance). 


The first step, correlation analyses was conducted to examine the 
strength of relations between the selected indices and the human 
scores of essay quality. Text indices retained based on a 
significant correlation with human scores. Multicollinearity was 
then assessed between the indices (r >.900). The index retained 
based the strongly with human scores when two or more indices 
demonstrated multicollinearity. Finally, totally thirty-one indices 
were used in the study. 


3.2 Essay Scoring 

277 essays were collected from college students in Taiwan. Each 
essay in the study was scored independently by two expert raters 
using a 5-point rating. The rating scale was used to assess the 
quality of the essays and had a minimum score of O and a 
maximum score of 5. The experts evaluated the essays based on a 
standardized rubric used in the Chinese writing literacy 
assessment in Taiwan. The results of correlation between two 
experts are 0.788. It indicated that consistency of expert scoring. 


3.3. Essay Evaluation 

Three different methods were used to examine the accuracy of 
automated essay scoring: linear regression analysis, discriminant 
analysis, and NWEFE classification. Text features were selected by 
linear regression and discriminant analysis. The leave-one-out 
method was used to experiment with training essay set and testing 
the essay set. The present compared the exact matches of the essay 
by using the three methods. 


4. Preliminary Results 


4.1 Linear Regression Analysis: Text Features 
A stepwise regression analysis was conducted to examine which 
text indicators were predictive of human essay ratings. 40 Chinese 
text features were used in the study. The results presented in Table 
1. Five indicators were a significant predictor in the regression 
model: Number of words, the number of low strokes, content 
word's frequency, minimal edit distance (all words) and the 
minimum frequency of content words, F = 12.074, p <.001, r 
=.747, r? =.558. The results from the linear regression 
demonstrate that the five variables account for 55.8% of the 
variance in the human scoring of writing quality. 


Table 1. Stepwise regression results for text features 


ee 


caterers) | _ ae | 


824 402 086 
frequency 
(all words) 
for content words 


4.2 Discriminant Analysis: Text Features 

The purpose of the discriminant analysis was to examine whether 
features are predictive of human scoring. The results of the 
discriminant analysis showed that seven text features could 
predict human scorning, includes the number of words, content 
word frequency, concreteness, Measure of Textual Lexical 
Diversity, minimal edit distance (part of speech), minimal edit 
distance (all words) and words per sentence. 


4.3 Exact and Adjacent Matches 


Table 2 and Table 3 presented the results of exact and adjacent 
matches. The linear regression analysis (stepwise) selected 
features: The number of words, number of low strokes, content 
words frequency, minimal edit distance (all words) and minimum 
frequency for content words. The exact matches (leave-one-out) 
between the predicted essay scores (rounded to O-5) and the 
human scores is 47.3% exact accuracy and 95.3% adjacent 
accuracy. 


The discriminant analysis (stepwise) selected features had the 
number of words, word frequency of content words, minimal edit 
distance (local), MTLD, the number of terms, concreteness, and 
minimal edit distance (part of speech). The exact matches (leave- 
one-out) between the predicted essay scores and the human 
scores is 47.3% exact accuracy and 93.9% adjacent accuracy. 


The present study conducted NWFE classification method to 
examine the effectiveness of automated essay scoring. The results 
showed that 48.7% exact matches between predicted scores and 
human scoring, which text features selected by linear regression. 
Moreover, 51.3% exact matches between predicted scores and 
human scoring, which text features selected by discriminant 
analysis. 


Table 2. Comparison of Exact 


Text features 
selected by 
Discriminant 


Linear regression 47.3% 46.6% 


Text features 
selected by linear 
regression 


Classification 
method 


NWFE 51.3% 


48.7% 


Table 3. Comparison of Adjacent 


Text features 
selected by 
Discriminant 


Text features 
selected by linear 
regression 


Linear regression 95.3% 


Classification 
method 


NWFE 


89.9% 
5. Conclusion 


Past studies have found that the number of words was an 
important indicator of human score [4, 15]. The results of the 
study also presented that the number of words has a high 
significant correlation with human scores. The number of words, 
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the minimal edit distance (local), and the number of low strokes 
three indicators belong to Descriptive and Syntactic Complexity 
categories in Coh-Metrix. MTLD belongs to Lexical Diversity. 
These indicators are related the scoring guide of writing for 
college students in Taiwan. 


Comparing exact matches between linear regression analysis 
(stepwise) and discriminant analysis (stepwise). The results of 
leave-one-out of exact matches linear regression and discriminant 
analysis showed consistency. Moreover, regardless of method 
linear regression analysis (stepwise) or discriminant analysis 
(step-wise) selection indicators, the accuracy of exactly matched 
of NWFE method is higher than the other two classification 
methods. 


6. Future Works 


Past studies have investigated the potential for component scores 
that are calculated using the linguistic features by Coh-Metrix in 
assessing text readability [9, 12]. Moreover, one study has 
explored correlations between human ratings of essay quality and 
component scores based on similar natural language processing 
indices and weighted through a principal component analysis [2]. 
However, this approach has not been extended to computational 
assessments of essay quality In Chinese. The present study will 
adapt a similar approach to passing studies [9, 12]. We will 
conduct a principle component analysis (PCA) or factor analysis 
to reduce the number of indices selected from Chinese automated 
text analyze tool into a smaller number of components comprised 
of related features. The present study will further explore the 
correlation between component scores and human scoring. A 
Chinese automated essay scoring model based on text component 
scores will be developed and explored. 
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ABSTRACT 


In this age of fake news and alternative facts, the need for a 
citizenry capable of critical thinking has never been greater. 
While teaching critical thinking skills in the classroom re- 
mains an enduring challenge, research on an ill-defined do- 
main like critical thinking in the educational technology 
space is even more scarce. We propose a difficulty factors 
assessment (DFA) to explore two factors that may make 
learning to identify fallacies more difficult: type of instruc- 
tion and belief bias. This study will allow us to make two 
key contributions. First, we will better understand the rela- 
tionship between sense-making and induction when learning 
to identify informal fallacies. Second, we will contribute to 
the limited work examining the impact of belief bias on in- 
formal (rather than formal) reasoning. We discuss how the 
results of this DFA will also be used to improve the next 
iteration of our fallacy tutor, how this tutor may ultimately 
contribute to a computational model of informal fallacies, 
and some potential applications of such a model. 


Keywords 

Cognitive Tutors, Informal Logical Fallacies, Informal Rea- 
soning, Cognitive Task Analysis, Difficulty Factors Assess- 
ment 


1. INTRODUCTION 


Despite the recognized importance of critical thinking in tra- 
ditional education, critical thinking is largely absent from 


the educational technology space (e.g., online courses/MOOCs, 


cognitive tutoring systems, etc.). Some of the recent work 
on critical thinking in educational technology has focused 
on comparing critical thinking in face-to-face and computer- 
mediated interactions. Researchers often use content-analysis 
to identify instances of critical thinking in online and face- 
to-face discussions [3, 10]. In this work, critical thinking is 
not the primary focus of the course, but rather an epiphe- 
nomenon. 
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Other work, particularly in the domains of philosophy, writ- 
ing and law, has addressed critical thinking more directly. 
For example, some recent work has demonstrated that ar- 
gument diagramming using a graphical interface improved 
argumentative writing skills [6] as well as critical thinking 
skills more generally [5]. However, similar gains are seen 
using paper-and-pencil argument diagramming as well, sug- 
gesting the software may be more of a convenience than a 
necessary factor [4]. 


Despite the challenges of working in an ill-defined domain 
[8], another intersection of critical thinking and e-learning 
has been in intelligent tutoring systems (ITS). For example, 
Ashley and Aleven [1] built an ITS to teach law students 
to argue with cases more effectively. The study we propose 
extends this work on critical thinking in the ITS space to 
a more general population. We will build a cognitive tutor 
that teaches users to identify several common informal log- 
ical fallacies. We chose informal fallacies because they offer 
a degree of structure to the otherwise ill-defined domain of 
informal reasoning, making the content more amenable for 
use in a cognitive tutor. Using this tutor, we will conduct a 
difficulty factors assessment (a type of a cognitive task anal- 
ysis) [7] to evaluate the impact of two factors on the user’s 
ability to identify logical fallacies. 


The first factor explored will be type of instruction. The 
Knowledge-Learning-Instruction (KLI) framework lists three 
types of learning processes, and suggests that the best in- 
struction for teaching a specific skill depends on the type of 
process used to learn that skill. The purpose of the type of 
instruction manipulation is to better understand the learn- 
ing processes that underpin the identification of logical fal- 
lacies. Specifically, we are interested in whether this skill is 
more efficiently learned using induction (e.g., showing many 
examples of the fallacy) or sense-making (e.g., providing de- 
tailed descriptions of the fallacy’s mechanics). Textbooks 
used to teach logical fallacies often take both approaches, 
giving readers an explanation of a fallacy followed by some 
small number of examples. As this skill may consist of mul- 
tiple, more fundamental skills (or knowledge components), 
the mixed approach used by textbooks may prove to be the 
most efficient. Nevertheless, the proportion of time to de- 
vote to each learning process remains an open question that 
this experiment may help answer. 


The second factor that may negatively impact a student’s 
ability to identify logical fallacies is belief bias, the tendency 
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Table 1: Breakdown of the problems used in the tutor. Note that (F), (A), (C), and (L) correspond to for, 
against, conservative and liberal, respectively. For example, in the first cell of the table, we see an apolitical 


prompt, which fallacy 1 is used to argue for. 


Apolitical Political 
Fallacy 1 (F) (C) ( 
Fallacy 2 (A) (L) ( 
Fallacy 3 (F) (C) ( 
Fallacy 4 (A) (L) ( 
Fallacy 5 (F) (C) ( 
Fallacy 6 (A) (L) ( 


to judge arguments more favorably if we agree with the con- 
clusion. Early work on belief bias explored its effect on for- 
mal reasoning using syllogisms [9, 2], but there is some evi- 
dence that suggests that belief bias may operate differently 
in informal reasoning [11]. The proposed study builds on 
and contributes to this research by empirically testing the 
effect of belief bias on learning to identify informal fallacies. 


2. FUTURE RESEARCH PLANS 


2.1 Difficulty Factors Assessment 

We will use a Difficulty Factors Assessment (DFA) to iden- 
tify the factors (if any) that make it more or less difficult 
for students to learn how to identify logical fallacies. The 
proposed experiment will explore the impact of two primary 
factors as well as several secondary factors. 


2.1.1 Type of Instruction 

The proposed experiment will explore the impact of type of 
instruction by randomly assigning each participant to one 
of three conditions. In each condition, when the participant 
is given a problem and asked to identify the logical fallacy, 
they will be given a set of possible answers and the option 
to view more information about each of the answers. In 
the first condition, when participants ask for more informa- 
tion they will be shown a brief, but detailed description of 
the mechanics of each fallacy (sense-making). In the second 
condition, participants will be shown two examples of each 
fallacy (induction). In the the third condition, participants 
will be shown a description and one example for each fallacy 
(mixed). 


In addition to comparing the effect of increased examples 
between groups, we will be able to compare this effect within 
groups by treating completed problems as viewed examples. 
This analysis will help us pinpoint the average number of 
examples needed to be able to identify the fallacies used in 
the experiment, and compare that number to the average 
numbers seen in common textbooks. 


2.1.2 Belief Bias 


The proposed experiment will explore the impact of belief 
bias on a student’s ability to identify logical fallacies by al- 
tering the political orientation of problem content and com- 
paring performance on those problems with the participant’s 
personal political orientation. Of the 36 problems presented, 
half will be apolitical (i.e., politically neutral) and half will 
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Apolitical 


Political Apolitical Political 


be political. Of the political problems, half will have a con- 
servative orientation, half a liberal orientation. The apoliti- 
cal problems are also split into two categories (for an issue or 
against an issue) for balance. Problems can be broken down 
into three subcomponents: the prompt (either political or 
apolitical), the fallacy, and the conclusion (either for/against 
or conservative/liberal). Table 1 shows the breakdown of 
each problem. 


2.1.3 Secondary Factors Explored 

In addition to the main effects of type of instruction and 
belief bias, our design also allows us to explore several sec- 
ondary factors. We can test whether type of instruction has 
a differential effect on specific fallacies. For example, sense- 
making may be more important for learning to identify a cir- 
cular argument, while examples may be sufficient for learn- 
ing to identify a Post Hoc fallacy. We can also test whether 
participants are more likely to identify a fallacy given the 
nature of the prompt (political vs. apolitical) or the valence 
of the conclusion (for/against or conservative/liberal). 


2.2 Towards a Computational Model of Logi- 


cal Fallacies 
The ultimate goal of this work is to develop a computational 
model of logical fallacies. Achieving this goal requires over- 
coming several large challenges. 


2.2.1 Lack of Labeled Examples 


First, to train a model to detect such a nuanced use of lan- 
guage will most likely require a large number of labeled ex- 
amples. Furthermore, these examples will most likely have 
to be varied and authentic (perhaps unlike many of the pur- 
posefully illustrative examples used in textbooks). To solve 
this shortage of labeled examples, we propose using our cog- 
nitive tutor to train crowd workers to identify fallacies in 
real-world media sources. The quality of those labels can 
be evaluated using traditional crowdsourcing methods (e.g., 
consensus of the crowd). High quality labels can then be 
automatically integrated into the tutor training system, in- 
creasing the number of potential examples crowd workers 
can use to achieve mastery. This increase in the number of 
examples may be especially important if our DFA reveals 
that learning to identify informal fallacies is a primarily in- 
ductive skill. Figure 1 shows the feedback loop relationship 
between crowd workers and the cognitive tutor. 
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Figure 1: Feedback loop relationship between the 
cognitive tutor and crowd workers. The real-world 
examples labeled by crowd workers can be used to 
both improve the cognitive tutor and train compu- 
tational models. 


2.2.2 Modeling the Semantic Nature of Fallacies 
Informal Logical Fallacies is an umbrella term that encom- 
passes a diverse array of fallacies. Some of these fallacies 
may be easier for a machine learning model to detect. For 
example, the Slippery Slope fallacy often has the generic 
structure: “First X, pretty soon there’ll be Y too!” These 
kinds of syntactic features will likely be easier to detect than 
the semantic features necessary to identify a fallacy like Cir- 
cular Reasoning. Finding the right method for approaching 
these more difficult cases will be one of the key challenges 
of this work. 


2.2.3 Potential Applications 

If we meet these challenges and are able to detect logical 
fallacies in real-world text, there are potential applications 
in media (both traditional and social), politics, and educa- 
tion. One could imagine a plugin for your favorite word 
processor that underlines an Appeal to Ignorance just as it 
would a misspelled word. Similarly, one could imagine how 
broadcasts of presidential debates in the future might be ac- 
companied by a subtle notification anytime a candidate uses 
Moral Equivalence. 


In conclusion, we propose a plan to develop a computational 
model of informal logical fallacies. The first, and most con- 
crete, step of this process is developing a better understand- 
ing of the factors that promote and hinder how we learn to 
identify informal fallacies. We propose a difficulty factors 
assessment to explore the impact of sense-making versus in- 
duction support, as well the impact of belief bias. Discover- 
ing how these factors regulate learning will not only allow us 
to build a better tutor, but will improve our understanding 
of how we learn informal logical fallacies in general. 
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ABSTRACT 


Recent mandates by federal funding agencies and universi- 
ties to create open access repositories of published research 
allow researchers a wealth of texts to analyze. Furthermore, 
some publishers of academic texts have begun creating poli- 
cies to permit non-commercial text mining of journal arti- 
cles. This project follows the approach of [7], which auto- 
matically extracts result sentences from full-text biomedical 
journal articles by using support vector machines and naive 
Bayes classifiers. I also experiment with using the least ab- 
solute shrinkage and selection operator (LASSO) [6, 18] asa 
method to select features for the classifiers. I compare this 
new approach with other feature selection strategies used in 
previous studies. 


Keywords 


Information extraction, text classification, feature selection 


1. INTRODUCTION 


Information overload is hardly a new concept, with even 
the Ancient Roman scholar Seneca the Elder claiming in 1 
AD, “the abundance of books is distraction” [8]. Similarly, 
the automatic summarization of text has been researched 
since at least the 1950’s, with Luhn’s work on creating ab- 
stracts automatically [11]. In concert, United States (US) 
federal funding agencies, such as the National Institutes of 
Health (NIH) [13], the National Science Foundation (NSF) 
[14], and the Institute for Educational Sciences (IES) [9], 
and university systems such as the University of California 
(UC) [1] have adopted open access policies for funded and 
published research. Publishers of academic journals, such as 
Elsevier [4] and Springer [15], have adopted policies for non- 
commercial research of texts. Finally, some national govern- 
ments (e.g., the United Kingdom (UK) [10]) have adopted 
changes to copyright law allowing for non-commercial re- 
search of copyright protected works. 
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Given these open-access and legal policy changes, a wide 
swath of researchers now have access to a wealth of texts 
to automatically analyze. Specifically, the shifts in policies 
and laws allows for text mining to extract result sentences 
from full-text journal articles. Further, publishers have cre- 
ated APIs which allow for access to texts. It is unlikely that 
future researchers will be able to carefully read and analyze 
all of the texts in order to extract pertinent results. How- 
ever, open-access policies in the US by the NIH have enabled 
automated extraction since the late 2000s in some fields. 


My research seeks to first expand the work done in the 
biomedical sciences, particularly in [7] to the educational 
sciences, but also to explore an additional feature selection 
technique. This experiment is to complement the work in 
[20] by using the LASSO as a feature selection technique. 


2. BACKGROUND 


Text mining has been recognized as a tool to reduce the 
time required to complete a systematic literature review [17]. 
There are several tasks text mining can simplify when cre- 
ating a systematic review. Current text mining approaches 
allow relevant studies to be identified, by identifying relevant 
search terms, and describing the characteristics of prior in- 
vestigations can be accomplished by automatic summariza- 
tion [17]. This proposal is inspired by the systematic search 
of literature using targeted queries by the information scien- 
tist, Don Swanson, who revealed a link between magnesium 
and migraines in the late 1980s [16]. This finding is novel 
because it linked medical literature with chemistry litera- 
ture. ‘Thus, I want to uncover previously unrealized links, 
contradictions, and confirmations in the current literature 
on on how students utilize computers to enhance or hinder 
their educational experience. 


Supervised learning using text has been heavily researched in 
the biomedical sciences. For example, [12] proposed to use a 
modified naive Bayes classifier which can determine whether 
an abstract is relevant for a given topic, based on the words 
in previously seen abstracts. ‘They also propose a unique 
weighting scheme which allows for high recall and reason- 
able precision. In their work, they show their proposed pro- 
cess can significantly reduce the time required to conduct a 
systematic literature review. Given the amount of publica- 
tions available following from the aforementioned changes, 
these results could help educational researchers significantly 
reduce time to determine which previously published work 


436 


is most relevant. 


More broadly, this work addresses the need to have a “liv- 
ing systematic literature review” where the most up-to-date 
published findings can be included for practitioners and re- 
searchers to implement and be informed of these findings 
[3]. One study found the average time between a published 
finding and inclusion in a systematic literature review to av- 
erage between 2.5 and 6.5 years [3]. This relates directly to 
an initiative by the US’s Institute of Educational Sciences 
to use evidence based practices [19]; that is, connecting the 
knowledge from research to practicing the knowledge. 


3. APPROACH 


This project will extract sentences containing results from 
full-text journal articles in peer-reviewed journals. Given 
that journals have dozens of volumes and issues, it is likely 
not feasible to read and find all relevant articles needed to 
understand prior research. ‘This process will create a sys- 
tematic review of literature from educational journals in a 
targeted area: student interaction and behavior in comput- 
ing environments. The systematic review will inform re- 
searchers on previous findings and update practitioners on 
the most current research. 


3.1 Extracting Results 

To extract result sentences, I will parse full-text journal arti- 
cles into sentences, using a tokenizer, for example, Python’s 
NLTK [2]. Next, I label the sentences as either containing 
a result or not, as well as indicate the section of the ar- 
ticle where the sentence lies, and whether the sentence is 
the first or last in the respective paragraph, following from 
[7]. In [7], result sentences were distributed throughout the 
journal articles and were most common in the first or last 
sentence of the paragraph. Then, I will experiment with var- 
ious classifiers, such as support vector machines, naive Bayes 
classifiers, decision trees, and various ensemble models. The 
output of the classifiers will be the sentences containing re- 
sults, which can then be used to form a thorough systematic 
review. 


To train these models, I will select features using traditional 
metrics, such as information gain, mutual information, and 
the x7 statistic [20], which are the ones used by [7]. Interest- 
ingly, using these three feature selection strategies, not one 
term was selected by all three methods; however, there was 
overlap with terms for the x? statistic and information gain, 
and information gain and mutual information. Because of 
this finding, I propose to use a different feature selection 
technique to select words or surface level knowledge (e.g., 
sentence position, section of paper) to train these classifiers. 


3.2 Feature Selection 

Another experiment I plan to conduct to extract words from 
the corpus of sentences from the journal articles is to uti- 
lize the LASSO to select words to use to train classifiers 
to discern sentences containing results from those that do 
not. Given that the LASSO is used for high dimensional 
data sets as a variable selection technique, in fields such as 
gene-expression analysis [5], this approach seems reasonable 
given the high dimensionality and sparseness of text data. 
I will experiment with various parameters of the LASSO 
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to ensure reasonable feature selection; that is, a feature set 
which is not prohibitively small to provide high recall and 
reasonable precision, but one which is not too big to prohibit 
generalizablity. 


The specific binomial logistic LASSO model I will use to 
select terms is 

- P(result = 1|x) 

: P(result = 0|x) 


= Bo+x" B, (1) 


where result equals one if the sentence x; contains a result, 
and zero otherwise. Note that x is a matrix, where each row 
is a sentence, one column is result, and the other columns 
are words and surface-level features about the sentence. In 
the estimation phase, the model’s likelihood function is pe- 
nalized by a shrinkage parameter A. ‘This shrinkage param- 
eter shrinks unimportant ($s towards zero, thus leaving only 
the most important terms with nonzero §s. These terms 
will then be used to train the classifiers to extract result sen- 
tences to be used in systematic literature reviews. Further, 
the magnitude of each @ can be beneficial in determining 
relative importance of a term. 


For this portion of the project, I will experiment with various 
As to determine which give the best performance when train- 
ing the models to extract result sentences. A comparison of 
the feature selection strategies in [7, 20] will be conducted to 
determine any relationship between these feature selection 
strategies and the LASSO. 


4. CURRENT STATUS 


My current tasks are to complete a literature review of text 
classification. In this literature review, I address traditional 
classifiers from multivariate statistics and machine learning, 
but also accompany background on generating systematic 
literature reviews. The literature review also includes a dis- 
cussion of evidence based practices and speculates on how a 
living systematic literature review might impact education 
research. 


A concurrent stage is procuring and processing texts for 
analysis. In [7], seventeen full-text articles were analyzed, 
with around 2550 total sentences being considered. ‘Thus, 
once all texts have been selected, I will begin labeling the 
sentences as containing a result or not containing a result. 
Efforts are underway to procure a small research fund to pay 
a research assistant to also label sentences as a measure of 
inter-rater reliability. 


5. PROPOSED CONTRIBUTIONS 


This work provides contributions to the fields of informa- 
tion science and educational data mining. One contribution 
is an alternative feature selection strategy which could im- 
prove performance of supervised learning methods. Because 
feature selection is arguably the most important analysis 
phase in text classification, using the LASSO in addition 
to strategies already used might help better performance in 
text classification. 


Another contribution of the work is introducing the con- 
cept of a living systematic literature review to educational 
research. Due to the explosion of the amount of published 
research in education, and the interest in evidence based 
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practice to be utilized in education, this work can address 
those desires. 


6. ADVICE SOUGHT 


I would like advice on any or all of these concerns: 


1. Are there other approaches, besides classifiers such as 
support vector machines, naive Bayes, discriminant 
analysis, neural networks, and decision tree classifiers 
that would be useful for this approach? 


2. What suggestions do you have for analyzing the re- 
sult sentences once they have been discovered by the 
classification algorithms? 


3. Do you have any suggestions for experiments with the 
shrinkage parameter, A, for selecting terms when using 


the LASSO? 


4. Are there any specific metrics you would suggest to 
use for analyzing the results of either result extraction 
or selecting terms? 


7. REFERENCES 

[1] Academic Senate of the University of California. UC 
systemwide academic senate open access policy, 2013. 

[2] S. Bird. NLTK: The natural language toolkit. In 
Proceedings of the COLING/ACL on Interactive 
presentation sessions, pages 69-72. Association for 
Computational Linguistics, 2006. 

[3] J. H. Elliott, T. Turner, O. Clavisi, J. Thomas, J. P. 
Higgins, C. Mavergames, and R. L. Gruen. Living 
systematic reviews: an emerging opportunity to 
narrow the evidence-practice gap. PLoS Med, 
11(2):e1001603, 2014. 

[4] Elsevier, Inc. Text and data mining policy, 2014. 

[5] J. Friedman, T. Hastie, and R. Tibshirani. The 
elements of statistical learning, volume 1. Springer, 
2001. 

[6] J. Friedman, T. Hastie, and R. Tibshirani. 
Regularization paths for generalized linear models via 
coordinate descent. Journal of statistical software, 
33(1):1, 2010. 

[7] H. A. Gabb, A. Lucic, and C. Blake. A method to 


automatically identify the results from journal articles. 


wConference 2015 Proceedings, 2015. 
[8] Hewlett Packard. Dizzying volumes of data is nothing 
new. 


[9] Institute of Educational Sciences. IES policy regarding 


public access to research, 2016. 

[10] Intellectual Property Office. Exceptions to copyright: 
Research, 2014. 

[11] H. P. Luhn. The automatic creation of literature 
abstracts. [BM Journal of research and development, 
2(2):159-165, 1958. 


[12] S. Matwin, A. Kouznetsov, D. Inkpen, O. Frunza, and 


P. O’Blenis. A new algorithm for reducing the 
workload of experts in performing systematic reviews. 
Journal of the American Medical Informatics 
Association, 17(4):446—453, 2010. 

[13] National Institutes of Health. Revised policy on 
enhancing public access to archived publications 
resulting from NIH-funded research, 2008. 


[14] National Science Foundation. NSF’s public access 
plan: Today’s data, tomorrow’s discoveries (NSF 
15-22), 2015. 

[15] Springer. Springer’s text- and data-mining policy, 
2016. 

[16] D. R. Swanson. Migraine and magnesium: Eleven 
neglected connections. Perspectives in Biology and 
Medicine, 31(4):526—-557, 1988. 

[17] J. Thomas, J. McNaught, and S. Ananiadou. 
Applications of text mining within systematic reviews. 
Research Synthesis Methods, 2(1):1-14, 2011. 

[18] R. Tibshirani. Regression shrinkage and selection via 
the lasso. Journal of the Royal Statistical Society. 
Series B (Methodological), pages 267-288, 1996. 

[19] US Department of Education: Institute of Educational 
Sciences. Identifying and implementing educational 
practices supported by rigrous evidence: A user 
friendly guide, 2003. 

[20] Y. Yang and J. O. Pedersen. A comparative study on 
feature selection in text categorization. In ICML, 
volume 97, pages 412—420, 1997. 


Proceedings of the 10th International Conference on Educational Data Mining 438 


Intelligent Argument Grading System for 
Student-produced Argument Diagrams 


Linting Xue 
North Carolina State 
University 
Raleigh, North Carolina, USA 


Ixue8@ncsu.edu 


ABSTRACT 


Current automated essay grading systems are typically fo- 
cused on the semantic and syntax analysis of written ar- 
guments via Natural Language Processing techniques. Few 


systems focus on the automatic assessment of argument struc- 


ture. In this work, we propose to build an Intelligent Argu- 
ment Grading System to automatically assess and provide 
feedback on the structure of arguments of student-produced 
argument diagrams, which are graphical representations for 
real-word argumentation. ‘The proposed system contains 
two stages. In the first, it automatically induces empirically- 
valid graph rules for expert-graded argument diagrams. An 
assessment model is trained from the dataset of manually- 
graded argument diagrams with the feature of induced graph 
rules. In the second stage, the assessment model automati- 
cally grades and provides feedback by identifying both good 
features and structural flaws in students’ work. The signifi- 
cance of this work will be that the proposed system can save 
high cost of labor by automatically inducing empirically- 
valid rules, grading, and providing feedback on the structure 
of arguments for students. We anticipate that the automatic 
feedback can help students revise their structural plans ac- 
cordingly before they start to write essays, which will in turn 
lead them to produce more high-quality arguments. 


Keywords 
Argument Diagrams, Structure of Arguments, Automated 
Grading System, Automatic Feedback 


1. INTRODUCTION 


Argumentation is an essential skill in scientific domains in- 
cluding physics, engineering, and computer science, where 


students must articulate and justify testable hypotheses through 


argumentative reasoning. As a consequence, automated es- 
say grading systems have become particularly useful tools 
for argument assessment (e.g. [1, 3, 9]). Prior research 
has shown that automated assessment systems can be used 
to assess student-produced arguments correctly and cost- 
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effectively. Current automated grading systems rely on ei- 
ther surface-level analysis of linguistic features within a bock 
of text (as in [3]) or deeper Natural Language Processing 
(NLP) that utilizes machine learning techniques (as in [9, 
1]). These systems are typically designed to evaluate on the 
basis of readability (e.g. the number of prepositions and 
relative pronouns or the complexity of the sentence struc- 
ture), shallow semantic analysis (e.g. lexical semantics or 
the relationships analysis among named entities), and syn- 
tax analysis (e.g. grammatical analysis). Ultimately, these 
systems return the scores or feedback on the content and 
the qualities of the students’ writing based on a predictive 
model that is trained by the dataset stored in the system. 


However, very few active systems are focused on automatic 
analysis of the rhetorical structure of arguments to address 
structural flaws. Argument structure refers to the organi- 
zation of the key components of argumentation (e.g. hy- 
potheses, citations, or claims), which can reveal how the 
students justify their research hypotheses by using relevant 
evidence to support or oppose conclusory statements. In 
real-life teaching, the students are encouraged to structure 
their argumentative essays before they start writing by for- 
mulating a research hypothesis based on the research ques- 
tion, listing relevant evidence and factual information, and 
identifying the logical relationships between them. Evalu- 
ating the draft structure of these arguments and identifying 
flaws can help students to revise their plans and to produce 
high-quality arguments in the future. It is possible for hu- 
man experts to grade draft arguments. However that process 
is costly and time-consuming. 


In this work, we propose to build an Intelligent Argument 
Grading System that can automatically grade and provide 
feedback on the structure of students’ arguments. The sys- 
tem will be based upon LASAD [4], an online tool for ar- 
gument diagramming and collaboration. The input to the 
system will be a valid argument diagram, the output is the 
grade and feedback pointing out the outstanding substruc- 
tures and structural flaws in the student’s work. 


2. BACKGROUND 


2.1 Argument Diagrams 

Argument diagrams are visual representations of real-world 
argumentation that reify the essential components of argu- 
ments such as hypotheses statements, claims, and citations 
as nodes and the supporting, opposing, and clarification re- 
lationships as arcs [6]. These complex nodes and arcs can 
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include text fields describing the node and arc types or free- 
text assertions, links to external resources and other data. 
Argument diagrams have been used in a variety of domains, 
including science [10], law[8] and philosophy [2] to help stu- 
dents learn written argumentation. Prior researchers have 
shown that argument diagrams can be used to scaffold stu- 
dents’ understanding of existing arguments [2] and can help 
to support scientific reasoning [10]. 


Comparison 
36: (~| comparison 4 
tneien Madsen claims a link betwsen a 
particular vaccine and autism 
while Bae caims link not proven. 


voy |Bale, like Madsen, acknowledges 
the potentia for harm. 


ure® Bale 2004 oF 


‘While potential risks and side-ef‘ects are known trey 
have not been prover 


Figure 1: A student-produced Argument Diagram. 


A sample student-produced diagram is shown in Figure 1. 
The diagram includes a hypothesis node at the bottom right, 
which contains two text fields, one for a conditional or if 
field, and the other for a consequent or then field. Two ci- 
tations are connected to the hypothesis node via supporting 
and opposing arcs colored green and red, respectively. ‘They 
are also connected via a comparing arc. Each citation con- 
tains two fields: one for the citation information and the 
other for a summary of the work; each arc has a single text 
field explaining what purpose the relationship serves. 


3. PRELIMINARY RESULTS 


In Lynch’s study of diagnosticity of argument diagrams [5], 
a set of 104 paired diagrams and essays were collected at 
the University of Pittsburgh in a course on Psychological 
Research Methods. The diagrams and essays were indepen- 
dently graded by an experienced TA according to a paral- 
lel grading rubric. They showed that hand-authored graph 
rules were empirically-valid and were correlated with the di- 
agram and essay grades; and thus that they could be used 
as the basis of predictive models for automatic grading. 


Our prior work has also shown that Evolutionary Computa- 
tion (EC) can be used to automatically induce empirically- 
valid graph rules for student-produced argument diagrams, 
and that the induced graph rules can be used as features for 
automatic grading [11, 12]. It is possible to harvest a set 
of diverse rules that were filtered via post-hoc Chi-Squared 
analysis [7]. This includes both good rules that are positively 
correlated with the diagram and essay grades and bad rules 
which are negatively correlated with the former representing 
positive structural features and the latter indicating flaws in 
the argument. 


Figure 2 shows an example of a positive graph rule (P-G) 
and a negative graph rule (N-G) induced in our prior work. 
P-G shows a graph structure where the students identified 
at least two related citations (cO & cl) that can be synthe- 
sized to support a single claim (k0) and where they included 
both a separate hypothesis (h) and an additional claim (k1). 
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kx .Type = “claim” 


kO oh - 
pr h.T ype = “hypothesis 
(P-G) 30 st kl c* .Type = “citation” 
/ \ s*« .Type = “supporting” 
cO cl 
C k.Type = “Claim” 
/ c.Type = “citation” 
(N-G) u h.Type = “hypothesis” 
/ h u.Type = “unspecified” 


Figure 2: Examples of positive and negative graph rule. 


It shows one of the structures that students have been en- 
couraged to incorporate into their arguments as it shows an 
ability to synthesize citations to form a complex claim. 


N-G is a negative rule that contains a single claim node (k) 
which is connected to a citation node (c) via an undefined arc 
(uw), and a separate hypothesis node (h) which may or may 
not be connected to the rest structure. This rule is a clear 
violation of the semantic guidance that students were given. 
In our experiment, the students were instructed to use un- 
specified arcs for definitions or clarifications. Some students 
instead used them only when they were unsure about the 
strength of their evidence or did not understand the cita- 
tion. 


4. PROPOSED SYSTEM 


In this work, we propose to build an Intelligent Argument 
Grading System (iARG) for student-produced argument di- 
agrams. Our goal is to automatically grade the structure 
of arguments for students and provide feedback that reflects 
the good features and structural flaws in students’ work. 
The proposed system includes two stages, which are shown 
in Figure 3. 


The top part of Figure 3 illustrates the first stage, Auto- 
matic Rule Induction, in which the system automatically 
induces empirically-valid graph rules for expert-graded ar- 
gument diagrams. The system will contain a database of 
argument diagrams and expert-assigned grades, along with 
a database of graph rules induced by the EC algorithm with 
a x-Squared filter as described in [11, 7]. After the system 
produces a set of individual rules, the induced rules are eval- 
uated by domain experts to determine whether or not they 
are semantically valid. Only valid rules will be incorporated 
into the database. Note that the induced rules contain both 
positive and negative examples. At the end of the process, 
we will use supervised learning methods to train an assess- 
ment model based upon the feature of induced rules and 
other graph feature (e.g. the degree of diagram nodes, the 
complexity of diagrams, and the attribute of the hub nodes 
in diagrams). 


In the second stage of Automatic Grading and Feed- 
back, the trained model will automatically grade and pro- 
vide feedback on students’ submissions by identifying both 
good features and structural flaws of the arguments. After 
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Figure 3: Flowchart for the proposed iARG 


this, we will have experts re-evaluate the automatic grades 
and give feedback periodically, and if necessary, to re-grade 
the submission. We include this step because the students’ 
submissions may include novel structures that are not in- 
cluded in the current rule database. In this case, the as- 
sessment model may treat these novel structures as outliers 
and provide uncorrected feedback. If the submissions are 
re-graded by experts, they will be updated to the database 
for argument diagrams. The rule database and assessment 
model will also be updated for future use. 


5. FUTURE WORK & OPEN QUESTIONS 


In the future work, we plan to achieve the following: 


1. In Fall 2017, we plan to work with domain experts to 
determine whether the induced graph rules are seman- 
tically valid; whether they can be used for automatic 
grading; and whether they include all of the good fea- 
tures and structural flaws in students’ work. This gives 
rise to our first research question: how can we improve 
the performance of the graph rule induction algorithm 
by inducing more empirically-valid graph rules? 


2. In Spring 2018, we will leverage different supervised 
learning methods to train an assessment model from 
our current dataset of expert-graded argument dia- 
grams with the feature of valid graph rules and other 
graph features. We will evaluate the assessment model 
on a new set of student-produced argument diagrams. 
Our second research question is that what other graph 
features can we use to build the assessment model? 


[11] 


[12] 
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. In Fall 2018, we plan to implement the proposed sys- 


tem based upon LASAD by building databases for the 
argument diagrams and for the graph rules, and inte- 
grating the assessment model into the system. 


. In 2019, we will test the performance of our system in 


an augmentative writing class at NCSU. We will focus 
on accessing the automatic grades and feedback from 
the student’s perspective and determine whether they 
find the automatic feedback to be useful. Thus we will 
not have experts to examine the automatic feedback in 
the second stage. Based upon the students’ feedback, 
we will consider whether to have experts to regrade 
the new submission and to update the database and 
assessment model. 


REFERENCES 

J. Burstein, C. Leacock, and R. Swartz. Automated 
evaluation of essays and short answers. 2001. 

M. Harrell and D. Wetzel. Improving first-year writing 
using argument diagramming. In The 35th CogSci, 
pages 2488-2493, 2013. 

M. A. Hearst. The debate on automated essay 
grading. [EEE Intelligent Systems and their 
Applications, 15(5):22-37, 2000. 

F. Loll and N. Pinkwart. Lasad: Flexible 
representations for computer-based collaborative 
argumentation. International Journal of 
Human-Computer Studies, 71:91-109, Januart 2013. 
C. F. Lynch and K. D. Ashley. Empirically valid rules 
for ill-defined domains. In J. Stamper and Z. Pardos, 
editors, Proceedings of The 7°” International 
Conference on EDM. IEDMS, 2014. 

C. F. Lynch, K. D. Ashley, and M. Chi. Can diagrams 
predict essay grades? In S. Trausan-Matu, K. E. 
Boyer, M. E. Crosby, and K. Panourgia, editors, /T'S, 
Lecture Notes, pages 260-265. Springer, 2014. 

C. F. Lynch, L. Xue, and M. Chi. Evolving augmented 
graph grammars for argument analysis. GECCO, 2016. 
N. Pinkwart, K. D. Ashley, C. F. Lynch, and 

V. Aleven. Evaluating an intelligent tutoring system 
for making legal arguments with hypotheticals. 
IJAIED, 19(4):401 — 424, 2009. 

L. M. Rudner and T. Liang. Automated essay scoring 
using bayes’ theorem. The Journal of Technology, 
Learning and Assessment, 1(2), 2002. 

D. D. Suthers. Empirical studies of the value of 
conceptually explicit notations in collaborative 
learning. In A. Okada, 5. Buckingham Shum, and 

T. Sherborne, editors, Knowledge Cartography, pages 
1—23. Springer Verlag, 2008. 

L. Xue, C. Lynch, and M. Chi. Unnatural feature 
engineering: Evolving augmented graph grammars for 
argument diagrams. In Internatinal Educational Data 
Mining, pages 255-262. IEDMS, 2016. 

L. Xue, C. F. Lynch, and M. Chi. Mining innovative 
augmented graph grammars for argument diagrams 
through novelty selection. EDM, 2017. 


4A] 


Industry ‘Track 


Dropout Prediction in Home Care Training 


*K 
Wenjun Zeng 
University of Minnesota 
Minneapolis, Minnesota 
zengx244@umn.edu 


Rul Kuang 
University of Minnesota 
Minneapolis, Minnesota 
kuang@cs.umn.edu 


ABSTRACT 

In Washington state (WA), SEIU 775 Benefits Group pro- 
vides basic home care training to new students who will 
deliver care and support to older adults and people with 
disabilities, helping them with self-care and everyday tasks. 
Should a student fail to complete their required training, it 
leads to a break in service, which can result in costly negative 
health outcomes (e.g. emergency rooms and hospitalization) 
for their clients [1]. 


In this paper we describe the results of utilizing machine 
learning predictive models to accurately identify students 
who exhibit a higher risk of drop out in two areas: (1) 
dropping out before attending first class|first class atten- 
dance]; and (2) dropping out before completing the train- 
ing|training completion]. Our experimental results show 
that AdaBoost algorithm gives a useful result with ROC 4uc 
= 0.627+0.013 and Precision at 10 = 0.73+0.12 for first class 
attendance and ROC auc = 0.680+0.024 and Precision at 
10 = 0.67+0.20 for training completion without relying on 
additional assessment data about students. In addition, we 
demonstrate the use case for constructing larger decision 
trees to help front-line training operations staff identify in- 
tervention strategies that create the most impact in prevent- 
ing dropout. 


1. INTRODUCTION 


By 2050, the number of Americans needing long-term home 
care services and supports will double[2], implying increased 
demand for workers providing home care services (called 
“personal care aides” nationally and “home care aides (HCA)” 
in WA). This will also increase the demand of training for 
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HCAs to provide quality care to their clients. In WA, should 
an individual wish to work as a home care aide, they are re- 
quired to complete a 75 hour, 2 week, Basic Training (BT) 
course within 120 days of their hire date. In WA, an HCA 
can begin providing care before completing their training as 
long as their deadline has not passed. In the event that an 
HCA fails to complete BT, she or he will fall out of com- 
pliance, leading to the HCAs termination and a break in 
service for the clients served by the HCA [1]. 


Educators have frequently used assessment tools that mea- 
sure cognitive skills, engagement, self-management and so- 
cial support to accurately predict student successes. How- 
ever, conducting assessments at scale is time consuming for 
both students and instructors. In the absence of a validated 
assessment specific to HCA profession, there is great interest 
in utilizing existing learning data to isolate the strongest pre- 
dictors of dropout through the predictive power of machine 
learning algorithms. Our research questions are two-folds: 
1) Can machine learning algorithms successfully predict stu- 
dent dropouts? 2) What are the risk factors related to early 
dropout from basic home care training? 


Many studies[3] have been conducted to explain academic 
performance and to predict the success or failure across a va- 
riety of students in a wide-range of educational settings. Ma- 
chine learning algorithms have been successful in predicting 
graduation|4], course participation[5], and other academic 
outcomes|6]. 


However current research has not fully investigated the area 
of using machine learning algorithms for on-the-job training, 
healthcare training programs, or adult education in general. 
In this paper, we focus on the dropout problems in home care 
training using machine learning methods. We were granted 
the latitude to be creative with our feature engineering, uti- 
lizing readily available data to meet business requirements. 


2. EXPERIMENTAL SETUP 


Figure | illustrates the four sequential time-based milestones 
in home care training: 1) Complete Orientation & Safety 
(O&S); 2) Register for a 70-hours BT course; 3) Attend the 
first class in this course; 4) Complete the 70-hour training. 
At the moment that a prospective home care aide enters the 
system, a “Tracking Date’ is assigned to their O&S training 
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Figure 1: Predicting ‘Targets and Features 


requirement, signifying the start of their training journey. 
On average a student will register for his or her first class 
approximately 19 days after completing O&S and will actu- 
ally attend his or her class about 64 days after entering our 
system. 


Predicting dropouts at different stages has the potential to 
allow for timely interventions that may improve a students’ 
learning experience. This paper focuses on two stages: First, 
Class Attendance: Will the newly hired students show up 
for their first scheduled class? We attempt to predict this 
at the point of registration. Second, Training Comple- 
tion: Will a student complete all 70 hours of their required 
training? We attempt to predict this at the point that a 
student attends his or her first class. As shown in Figure 1, 
some basic but sometimes incomplete student demographic 
data are captured at the time a student is assigned to take 
O&S training. As a student progresses in his or her training 
journey, we are able to extract more features about learning 
behavior, such as the amount of time a student needed to 
complete O&S or the number of days it took a student to 
register for class. In addition, we leveraged external gov- 
ernment census data to augment the existing feature set by 
adding income and population data of the student’s county 
of residence. 


We built four models — Logistic Regression, SVM, Random 
Forests, and AdaBoost — for the two predicting targets de- 
scribed above. Our final data set contained 5,303 records 
for predicting first class attendance and 5,182 records for 
predicting training completion. For both predicting targets, 
we reserved 2,000 records for testing data set and the re- 
maining were utilized as the training data set. We collected 
22 features to predict class completion and used the first 
19 features to predict first class attendance(the last three 
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features are not available at our prediction point of regis- 
tration). Table 1 summarizes the features we used for the 
model. 


3. EXPERIMENT RESULTS 


3.1 Prediction Performance: ROC-AUC and 


Precision at k 

We use area under curve of the receiver operating charac- 
teristic (ROC 4uvc) and precision at k (PrecQ@k) to evalu- 
ate prediction quality of each machine learning technique. 
ROC auc was used as a standard evaluation metric to mea- 
sure the quality of overall ranking results. PrecQ@k was used 
to determine the quality of predicting the top k outcomes, in 
our case, the top k students of highest drop out risk at each 
stage. It is assuming that, with limited resources, front-line 
staff could only outreach to k number of students per week to 
provide support and assistance to HCAs struggling to meet 
their individual learning needs. Therefore, it is essential to 
accurately predict the first & students exhibiting the highest 
dropout risk. 


Figures 2a and 2b depict the prediction results of our 4 mod- 
els articulated by precision at k. The AdaBoost model gives 
the best prediction result for both targets. For predicting 
first class attendance, AdaBoost with tree number = 2000 
has the highest precision at 10 which equals to 0.73 and Ad- 
aBoost with tree number = 1000 gives the best precision at 
20, 50, 100 which equals to 0.67, 0.56 and 0.46 respectively. 
For predicting BT completion, AdaBoost with tree number 
= 100 gives the best precision at 10, 20, 50, 100, which 
equals to 0.67, 0.62, 0.53, 0.44 respectively. As there are 
more students who did not attend the first class (385/2000 
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Table 1: 


Feature 
provider_type 


student_ethnicity 
student_language 
student_age 
os_month 

os_day 


Type 
Nominal 


Nominal 
Nominal 
Numerical 
Numerical 
Numerical 


class_language_containEnglish Boolean 


class_language_containOther 
county 

county_income_mean 
county_income_median 
county_population 
os_transferredhours 
duration_to_oscomplete 


first_module 
duration_to_class 


first_class_interpreter 
duration_to_class_registration 


num_terminations 
student_noshow_count 


student_withdraw_count 
num_class_attendee 


1.00 - 


0.73 - 


Precision at k 


be ay 


in 20 


Algorithm —®- AD —*— LR —- RF —— SVM 


(a) Precision at k for first class attendance 
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Boolean 
Nominal 
Numerical 
Numerical 
Numerical 
Numerical 
numerical 


Nominal 
Numerical 


Boolean 
Numerical 


Numerical 
Numerical 


Numerical 
Numerical 


Features used for class attendance and training completion prediction 


Remarks 

Individual provider (paid by the Department of Social and Health Services) or 
agency provider (paid by private home care agencies). {IP, AP} 

student ethnicity. {Asian Indian, White etc} 

student language. {English, Russian, etc} 

student age. {Mean = 39, Median = 37} 

Month of O&S tracking date. {1,2,---,12} 

Day of O&S tracking date {1,2,---, 31} 

Whether the student’s profile includes an English language selection. {Yes, No} 
Whether the student’s profile includes a language other than English.{Yes, No} 
student’s county of residence {King County, Pierce County,etc} 

The mean income(in USD) for the county.{mean = 67011, median = 65498} 

The medium income(in USD) for the county. {mean = 55468, median = 54727} 
The population for the county. {mean = 28672, median = 29582} 

Transferred hours for O&S. {mean = 0.9965, median = 0} 

Duration(in number of days) beween O&S completion date and O&S tracking 
date.{mean = 0.842, median = 1.500} 

The module of first registered class {Module 1, Module 2,..., Module 20, etc} 
Duration(in number of days) between class date and O&S tracking 
date.{mean=72.05, median = 67.42} 

Whether the student articulated a need for interpreter services. {Yes,No} 
duration(in number of days) between class registration date and O&S tracking 
date.{mean = 32.647, median = 19.784} 
Number of terminating employment 
class.{0,--- ,7} 

Number of class absences before attending the first class. {0,---,58} 
Number of class withdrawals before attending the first class. {0,---,60} 
Number of attendees in the first class. {3,---,33} 


relationships before attending first 


1.00 - 


O.73- 


Precision at k 


25- 


400 10 20 100 


k 


Algorithm —®- AD —&— LR —@- RF —— SVM 


(b) Precision at k for training completion 


Figure 2: Precision at k results 


AAA 


[Model Tt Class Attendance [Training Completion 
SVM(radial 0.578+0.012 0.600+0.011 
LR 0.612+0.020 


AD(1000) 
AD(2000) 
RF (2000) 


0.627+0.013 
0.626+0.015 
0.6080.012 


0.634+0.018 
0.673+0.025 
0.680+0.024 
0.672+0.023 


Table 2: ROC ayc results 


= 19.25%) than the number of students who did not com- 
plete the training (229/2000 = 11.45%), it was slightly easier 
to predict top k students who were likely to not show up for 
their first class and explains the higher Prec@k for predict- 
ing class attendance. 


Table 2 shows ROC'guc results. For predicting first class 
attendance, AdaBoost with tree number = 1000 gives the 
best ROC auc at 0.627. For predicting BT completion, Ad- 
aBoost with tree number = 2000 gives the best ROC auc at 
0.68. Low ROC'auc indicates the need for stronger inputs 
and feature attributes to the models. Although 19 out of 
22 attributes were shared in both predicting problems, at- 
tributes such as duration to class registration, duration to 
class and first module were more useful in predicting BT 
completion than in predicting class attendance. This ex- 
plains the increased ROC'guc results for BT completion pre- 
dictions. It provides an opportunity to understand why stu- 
dents choose to not attend their registered training classes 
and to collect more data at this early stage of the training 
journey. 


3.2 Risk Profile Analysis 


In this section, we illustrate how we use insights derived 
from decision tree modeling to profile students with differ- 
ent dropout rates, providing a tool to isolate target segments 
of high risk students so the business can take measures that 
can decrease dropout rate. Decision tree modeling enable us 
to acquire foundational knowledge necessary to develop ed- 
ucated hypotheses for customized interventions to support 
students with different risk profiles. Variable importance 
analysis using Random Forest also enhances our understand- 
ing of what factors influence training dropout and assists in 
our predictions. 


At the root note of Figure 3a, the average first class atten- 
dance rate is almost 81% among 5,303 students. That is, 
the overall dropout rate is 19%. For students who didn’t 
enroll in either module 1 or 2 as their first class’, they 
demonstrated a significantly higher risk of not attending the 
training — 54% will not show up for their first registered 
class. Using the same decision tree, we are also able to infer 
that both county and age are important factors. For exam- 
ple, students who do not reside in certain counties * above 
and are younger than 49 are less likely to attend the first 


‘Currently, students are allowed to attend classes out of se- 
quence in order to complete their training before the manda- 
tory deadline. 

“Counties include: Benton, Clark, Cowlitz, Douglas, Grays 
Hoarbor, Lewis, Mason, Skagit, Stevens, Walla Walla and 
Whatcom 
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class compared to those who are older than 49. Younger 
students, English speaking students and students who take 
longer to complete O&S exhibit higher risk of not attend- 
ing their first class. The variable importance from random 
forest shows that duration to class registration, duration to 
class are other most important indicators. ‘The larger the 
time gaps, the higher the dropout rates are. 


Figure 3b gives a decision tree for training completion. From 
the display, we can see if students have two or more class ab- 
sence records before actually attending the first class, their 
completion rate decreases to 60%, which is much lower than 
the average completion rate of 89%. Among these students, 
if their first class is not Module 1, then the likelihood that 
the student will complete training drops to 27%. It shows 
duration to class registration and class location (i.e county) 
play important role for training completion. Duration to 
class and student age are also shown as important indica- 
tors using random forest variable importance analysis. In 
addtion, knowing the count of class absence record and first 
class module gives a much better understanding about the 
BT completion. Figure 3b shows that even for students who 
had one or zero class absences. If they register for the class 
too late (in our case this amounts to more than 52 days after 
being hired), then the probability of completing the training 
is even lower. 


4. RELATED WORK 


Prior studies([3],[7],[8]) have been conducted to explain aca- 
demic performance and to predict the success or failure across 
a variety of students in a wide-range of educational settings. 
These studies focused heavily on the explanatory factors 
associated with a student’s learning behavior and training 
journey and which of those may cause separation between 
student types. Machine learning algorithms have been suc- 
cessful in high school and college education settings, most 
helpful in predicting graduation|4], course participation[5], 
and other academic outcomes|6]. These algorithms also pro- 
vide great value to the student success|9]. 


Lakkaraju et al.[6] used several classification models to iden- 
tity students at risk of adverse academic outcomes and used 
precision_at_top_K and recall_at_top_K to predict risk early. 
The authors compared ROC curves for two cohorts for algo- 
rithms Random Forest, AdaBoost, Linear Regression, SVM 
and Decision ‘Tree. The authors demonstrated that Ran- 
dom Forests outperformed all other methods. Aguiar et 
al.[10] selected and prioritized students who are at risk of 
not graduating high school on time by prediction the risk 
for each grade level and reported precision at top 10%, ac- 
curacy, and MAE for ordinal prediction of time to off-track. 
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Johnson et al.[11] used d-year-ahead predictive model to pre- 
dict on-time graduation for different grade level. Vihavainen 
et al.[5] found a higher likelihood of failing their mathemat- 
ics course could be detected in an early stage using Bayesian 
network. Radcliffe et al.[4] used logit probability model and 
parametric survival models to found that demographic info, 
academic preparation and first-term academic performance 
have a strong impact to graduation. Dekker et al.[12] gave 
experimental results which showed decision trees gave a high 
accuracy for predicting student success and improved pre- 
diction accuracy using cost-sensitive learning. 


Other prior studies have highlighted some important indi- 
cators that influence students’ performance like a student’s 
age and absence rates|6]. Based on these features, Early 
Warning Indicator (EWI) systems are rapidly being built 
and deployed using machine learning algorithms|6]. Simi- 
lar to other research in Educational Data Mining (EDM), 
we use precision at k to measure the prediction result((6], 
[10], [13]) and, like in traditional education systems, our mo- 
tive is to most effectively and efficiently target our limited 
resources to assist and suppor students. Typically, ensem- 
ble models outperformed individual models[7] and this held 
true in our case as well. While random forest has proven to 
be an extremely useful and powerful machine learning tech- 
nique in educational research{11], our results indicated that 
AdaBoost outperformed random forest. 


5. CONCLUSION AND FUTURE WORK 


In this study, we demonstrated preliminary results for pre- 
dicting home care student training dropout from a large, 
heterogeneous dataset containing student demographics and 
engineered features extracted from training patterns. Pre- 
dicting dropout at varying stages of an adult learner’s train- 
ing journey yielded promising results from a skewed dataset 
of over 5,303 students with AdaBoost (2,000 trees) providing 
the strongest predictions (precQ@10 = 0.73 and ROC 4uc = 
0.625. Prior history of class absence and time effects (du- 
ration to registration, duration to first class) were among 
the strongest individual predictors of dropout, as were class 
module sequence, county, and student age. ‘The results 
demonstrate that applying machine learning techniques to 
demographic data and learning behavior data (e.g. dura- 
tion to registration, duration to first class) can achieve ade- 
quate prediction quality in predicting the top k& highest risk 
students out of a pool of newly hired HCAs. This enables 
efficient use of limited capacity and resources to support 
students of greatest need. Insights revealed in this study 
inspired training operation staff to explore alternatives, in- 
cluding encouraging newly hired HCAs to register for train- 
ing early and strongly recommend proper class sequence to 
support students success in their training. 


Future work will investigate collecting more information about 
students, such as their motivations, propensity for self-efficacy, 


and life circumstances to determine if there are other factors 
at play on a personal level that my uncover additional fea- 
tures that can contribute to our target predictions around 
training dropout. 
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first_modul¢=cdfhijklmq county=b¢ghikimpaqstv 
n=192 n=5111 
0.454 cou WV ab fgjkl studen} ages 49.02 0.8777 
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Figure 3: Decision ‘Trees 


Proceedings of the 10th International Conference on Educational Data Mining 


447 
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ABSTRACT 


Knowledge Tracing plays a key role to personalize learning in 
an Intelligent Tutoring System including funtoot. Bayesian 
Knowledge Tracing, apart from other models, is the sim- 
plest well-studied model which is known to work well. Re- 
cently, Deep Knowledge ‘Tracing based on Deep Neural Net- 
works, was proposed with huge promises. But, soon after, 
it was discovered that the gains achieved by DKT were not 
of significant magnitude as compared to Performance Fac- 
tor Analysis [13] and BKT and its variants proposed in [6]. 
In the quest of examining and studying these models, we 
experiment with them on our dataset. We also introduce 
a logical extension of DKT, Multi-Skill DKT, to incorpo- 
rate items requiring knowledge of multiple skills. We show 
that PFA clearly outperforms all the above mentioned mod- 
els when the AUC results were averaged on skills while PFA 
and DKT, both were equally good, when they were averaged 
on all data points. 


Keywords 

Deep Knowledge Tracing, Adaptive Learning, funtoot, 
Bayesian Knowledge Tracing, Intelligent Tutoring System, 
Performance Factor Analysis 


1. INTRODUCTION 


An Intelligent Tutoring System’s main aspect is to deliver 
the instruction and provide feedback as and when required. 
To do that, the system requires to measure the knowledge 
state of a student with respect to the content available. The 
system continuously monitors the student’s performance, 
updates the knowledge state and based on that takes fur- 
ther decisions. The techniques capable of performing these 
functions are called Knowledge Tracing models. 


Bayesian Knowledge Tracing [2] has been one of the most 
predominantly researched models in the educational data 
mining domain. BKT is a 2-state skill specific model, where 
the student’s knowledge state can take either of the two 
values: learned or unlearned. Moreover, a skill once learned 
cannot be unlearned. ‘These assumptions make it a very 
simple and constrained model and has led lots of researchers 
to extend the model by enhancing it with new features to 
improve its performance; making it less constrained so to 
say. For instance [10] extend BKT in the scenario where the 
students do not necessarily use the system in the same day. 


Authors of [14] proposed an individualized BKT model 
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which fits not only the skill specific parameters, but also stu- 
dent specific parameters and have reported significant gains 
over standard BKT. 


Educational data mining techniques can now very accurately 
predict how much a student has learned a Knowledge Com- 
ponent (KC). But it doesn’t give information about the ex- 
act moment when the KC was learnt. [3] discusses a tech- 
nique about finding a moment of learning. 


Another model Performance Factor Analysis (PFA) is a lo- 
gistic regression model proposed in [7] which showed better 
performance than standard BKT. Unlike BKT, PFA can in- 
corporate items with multiple skills. PFA makes predictions 
based on the item difficulty and historical performances of a 
student. [4] has compared BKT and PFA by using various 
model fitting parameter models like Expectation Maximiza- 
tion (EM) and Brute Force (BF). Knowledge tracing models 
with EM have shown performance comparable to PFA|4]. 


The most recently published model - DKT [9] is the newest 
technique in this area of research. DKT is an LSTM [5] net- 
work, a variant of recurrent neural network [11] which takes 
as input a series of exercises attempted by the student and 
correspondingly a binary digit suggesting if the exercise was 
answered correctly or not. DKT has shown significant gains 
over BKT which is a very tempting gain for any researcher 


in this community to look into and study further. Papers 
like [6], [13] and [12] did just that. 


Authors in [13] have pointed out few irregularities in the 
dataset used by authors in [9] which, when accounted for, 
reduce the gain reported by using DKT. They also reported 
that DK'T doesn’t quite hold an edge when the results are 
compared with PFA. 


Another standard framework for modelling student re- 
sponses, Temporal extension of Item Response Theory (IRT) 
is compared with DKT in [12]. Authors have reported that 
the variants of IRT consistently matched or outperformed 
DKT. 


Recent paper [6] studies DKT even further and explains why 
DKT might be better. It has been pointed out that DKT 
inherently exploits the characteristics of the data which stan- 
dard models like BKT cannot. So, in order to make a fair 
comparison between the two, authors have presented three 
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different variants of BKT with forgetting, skill discovery and 
latent abilities which might help BKT make use of informa- 
tion from the data the way DKT does. 


Having introduced these variants, the authors also make a 
point that Knowledge ‘Tracing might not require the “depth” 
that deep learning models offer. 


Being an Intelligent Tutoring System, funtoot’s tutor mod- 
ule requires sophisticated knowledge tracing technique which 
models the process of knowledge acquisition and helps stu- 
dents achieve mastery. One such model operates at the level 
of LGs (discussed in section 2) which models the commit- 
tance and avoidance of them with time and practice. In the 
context of this paper, these LG models are of prime impor- 
tance to us and henceforth we will refer LGs as skills. Also, 
considering user experience, we need a model which can be 
used for predictions in real time without compromising on 
user latency. 


In this paper, we test standard BKT, the variants of BKT, 
DKT and PFA on the funtoot dataset and examine the 
results. We also introduce a logical and trivial extension 
of DKT to accommodate the items which involve multiple 
skills. Out of all the models considered in this article, PFA 
is one such model which allows items with multiple skills. 
But in our dataset, each of the skills in the item has its own 
response and hence it is modelled separately in PFA. 


The rest of the paper is organized as follows: section 2 gives 
a brief introduction to our product funtoot and its knowl- 
edge graph. Section 3 discusses the experiments on funtoot 
dataset and results. Section 4 discusses the future work and 
conclusion. 


2. FUNTOOT 


Funtoot’ is a personalized digital tutor which is currently 
being used actively in around 125 schools all over India with 
the total of 99,842 students registered. The curriculum of 
math and science for grades 2 to 9 is covered by funtoot. 


Schools in India are typically affiliated with one of the boards 
of education”. Curriculum for math and science from the 
following boards of education are included in funtoot: 


e CBSE? board for grades 2 to 9, 
e Karnataka State Board* for grades 2 to 8, 
e ICSE? board for grades 2 to 8 and 


e IGCSE® board for grades 2 to 3. 


‘http: //www.funtoot.com/ 
*nttps://en.wikipedia.org/wiki/Boards_of_ 
Fducation_in_India 

’https://en.wikipedia. org/wiki/Central_Board_of_ 
Secondary_Education 

“nttps://en.wikipedia. org/wiki/Karnataka_ 
secondary_Education_Examination_Board 

https: //en.wikipedia. org/wiki/Indian_Certificate_ 
of _Secondary_Education 
°hnttps://en.wikipedia.org/wiki/International_ 
General_Certificate_of_Secondary_Education 
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2.1 Funtoot Knowledge Graph 


Pedagogy team at funtoot has created a funtoot ontology 
around the subjects Math and Science. This ontology rep- 
resents the various learning units of any subject and their 
relationships, which is created based on human expertise in 
the subject matter. All the above mentioned curricula are 
later derived from this funtoot ontology based on the age 
group and grade. 


An ontology for a subject is created as follows: 


1. a subject is broken down into the smallest teachable 
sub-sub-concepts 


2. it is then mapped to determine _inter- 

dependencies/connections between concepts, sub- 
concepts (sc) and sub-sub-concepts (ssc) as shown in 
the figure 1, 
Consider the example shown in figure 1. Subject 
Math contains a concept Triangle, and ‘Triangle 
contains a sub-concept Congruency. Sub-concept 
contains two sub-sub-concepts: Rules of Congruency 
and Applications of Congruency. Sub-sub-concepts 
are connected by “depends-on” relationship. Here, 
Applications of Congruency is dependent on Rules 
of Congruency, which suggests that the latter is a 
prerequisite for the former. 


3. learning gaps (definition 1) are determined in the 
sub-sub-concepts 


DEFINITION 1. Learning Gap (LG): “A learning gap 
is a relative performance of a student in a specific skill, 
1.e. difference of what a student was supposed to learn, 
and what he actually learned in a skill. 7” 


“A misunderstanding of a concept or a lack of knowl- 
edge about a concept that is required for a student to 


solve or answer a particular question 1s also a learning 
I) 


gap 


For instance, a question “Solve 12 + 18” is given to 
student Alice. If Alice makes a mistake while adding 
carry and answers 20, we say that a LG (carry-over 
error) has been committed. Had she answered 30, this 
LG would have been said to be avoided. This question 
might also have other LGs which could have been com- 
mitted simultaneously with the LG mentioned above. 
If the response is correct, all the LGs of a question are 
said to have been avoided. 


In figure 1, Applications of Congruency is an ssc con- 
taining LG,, LG2 and LG3. Learning gaps can have 
“induce” relationships. In our example, LG, induces 
LG32. 


4. inter-dependencies get refined based on the data-points 
received by funtoot through the user’s interaction 


5. an SSC is further divided into six Bloom’s Taxon- 
omy Learning Objectives (btlos) using Bloom’s Tax- 
anomy [1]. Each learning objective has five difficulty 


“http: //edglossary.org/learning-gap/ 
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levels as shown in table 1. Each cell (for instance, 
Rememberl, Apply2 and so on) in table 1 is called a 
complexity in funtoot. 


“induces 


C ))subject Sp concept QD Sub-concept 


Sub-sub- 
concept 


Learning Gaps 


Figure 1: Funtoot Knowledge Graph 


2.2 Dataset 


During a student’s interaction with funtoot, informa- 
tion like: session, the scope of the question (which in- 
cludes grade — subject — topic — subtopic — subsubtopic — 
complexity — question), question identifier, start time, to- 
tal attempts allowed based on the student’s performance, 
time taken, attempts taken, information about hints, LGs 
committed in each attempt, assistance provided and so on 
is logged. 


In the study presented in this paper, we model LG as a skill. 
We aim to predict a student’s proficiency in a particular LG. 
When a student is presented with an item, several attempts 
are provided to solve it. In an unsuccessful attempt a stu- 
dent might commit more than one LG as explained in sec- 
tion 2 and the same LG can also be committed in several 
attempts. We know apriori the set of LGs that are exposed 
by a question. With this information at hand, we need an 
impression of each of these LGs for the student in the con- 
text of this item. 


Consider a hypothetical example. Alice attempts an item q 
from a subtopic Rules of Congruency having skills s1, s2, 53. 
The series of attempts is shown in table 2. 
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Table 2: Attempts made by Alice while solving q 


In the above table, 1 represents avoidance and 0 represents 
committance. As shown in the table, Alice committed s1 in 
attempts 1, 2 and 3. Alice committed s2 in attempt 3. Alice 
avoided s3 in all attempts. The overall outcome of Alice in 
LGs s1, s2 and s3 is (0,0,1) which is a logical AND over 
all attempts. This means that s; and s2 are committed and 
s3 is avoided. From now on, we will refer these outcomes 
as committances and avoidances and they will be used for 
modelling. So this problem attempt of Alice gives rise to 
three data points. 


For this experiment we have used data of 6°” grade CBSE 
math from date 2015 — 07 — 25 to 2017 — 01 — 30. Syl- 
labus descendant hierarchy for this dataset is as follows: 22 
topics, 69 subtopics, 119 sub-sub-topics, 541 complexities 
and 1,524 problems. This dataset has 26, 06,022 entries of 
problem attempts involving 442 skills. ‘This data is about 
176 schools with 11, 820 students and 1,524 problems. From 
this dataset, the data of students having less than 100 prob- 
lem attempts were excluded. This gives us 24, 47,027 prob- 
lem attempts involving 442 skills with 7780 students and 
1,523 problems. Finally, we have 56,04,227 data points 
where 42, 68,503 are avoidances (class 1) and 13,35, 724 are 
committance (class 0). 


In the context of the example shown in table 2, the length 
of Alice's attempt to solve a question q can be said as three, 
as there are three skills involved. Given this definition, of 
length of the problem attempt, figure 2 shows the distribu- 
tion of the length of the problem attempts in the dataset. 
38.18% of the total problem attempts have 1 skill, i.e., length 
is 1 and 29.47% of the problem attempts have length 2. 


3. EXPERIMENTS 


In this section, we discuss the experiments done on our 
dataset and report the results. Consider a hypothetical 
dataset of student Alice attempting questions qi and q2 in 
the same order. Question qi has three skills A,B and C, 
question q2 has two skills B and C’. Alice gets only one 
attempt for both the questions wherein she commits skill 
B and C and skill B in questions qi and q2 respectively. 
This example is used in this section to explain the training 
datasets for each of the techniques. 
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Figure 2: Data Distribution 


3.1 Bayesian Knowledge Tracing 

After DKT [9], authors in [6] have explored and hypothe- 
sized the properties of the data which DKT exploits while 
the standard BKT cannot. To equip BK'T with those capa- 
bilities, the authors have proposed three variants of BKT: 
BKT with forgetting (BKT+F), BKT with skill discovery 
(BKT+S) and BKT with latent-abilities (BKT+A). 


We have used the author’s implementation of BKT 
and its three variants published on https://github.com/ 
robert-lindsey/WCRP/tree/forgetting to train on our 
dataset. The data format required by these BKT variants 


is as shown in table 3. As discussed in the earlier section 1, 


IMD 


Table 3: BKT data format 


BKT is a skill specific model and thus, three models need to 
be built one each for skills A, B and C’. Each model needs 
the time series of responses as shown in the table 3. 


All variants of BKT except the ones where skill discovery 
is involved, namely BKT, BKT+F, BKT+A and BKT+FA 
operate on the skills provided by the data. The remaining 
variants: BKT+5S and BKT+FSA completely ignore the ex- 
pert tagged skills available in the data. This is achieved 
by setting the non-parametric prior, 6 on the expert tagged 
skills as 0. 


3.2. Performance Factor Analysis 

Like BKT, PFA being a skill specific model requires a dif- 
ferent model to be built for each skill. Logistic Regression 
model of [8] is used in the implementation of PFA. For each 
skill, the response is a function of the skill difficulty, number 
of prior student success (avoidances) responses and num- 
ber of prior student failure (committances) responses for the 
skill. From the implementation point of view, the decision 
function has two variables - the number of prior success in- 
stances and the number of prior failure instances for the skill. 
Also, a bias is added in the decision function (achieved by 
the intercept) which serves as the skill difficulty. The data 
format needed by PFA is as shown in figure 4. 
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“ ID | no. of failures | no. of successes response 


Table 4: PFA data format 


3.3. Deep Knowledge Tracing 

The implementation of LSTM based DKT published on 
https://github.com/mmkhajah/dkt is used to train our 
dataset. The neural network of DKT requires the input 
as one hot encoding of skills as well as responses for each of 
them, while output is the probability of correctness of each 
of the skills. Hence the size of the input is twice the num- 
ber of skills and that of the output is the number of skills. 
The serial number in the table 5 shows the order in which 
the inputs are fed into the network. The input in the table 
signifies the previous output while the response shows the 
expected output out of the network. The odd bits in the 
input represent one hot encoding of the skills while the even 
bits represent their responses. X in the output shows that 
the bit can take either O or 1. 


serial no. input response 


Table 5: DKT data format 


As discussed in subsection 2.2 that to figure out the final 
outcomes for the LGs in an item attempt, there is no clear 
or fixed ordering. But the time series to be fed into the net- 
work of DKT requires us to establish the ordering between 
them. We sample the orderings randomly and average the 
results on them. The sample dataset in the table 5 is one 
such ordering. Another random ordering can be seen in the 
table 6. The skills of the item qi are in the order A, B, C 
in table 5 while their order is B, A, C in table 6. The other 
way to get an ordering is to get rid of the ordering itself 
by merging the data points of the skills in an item which is 
explained in the following subsection. 


serial no. input response 


Table 6: Shuffled skills DKT data format 


3.4 Multi-skill DKT 


As explained in the context of DKT, the orderings among the 
skills in the item are sampled randomly. In order to get rid 
of such orderings, we introduce an extension of DKT: Multi- 
skill DKT which can incorporate the items having multiple 
skills efficiently. It can be seen from the table 7 that the 
three data points of gi and two data points of g2 are con- 
solidated and we are left with two data points in total. The 
size and structure of the inputs and outputs still remain the 
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Table 7: Multi-Skill DKT data 


same. The only difference is that the input and output can 
have the information about multiple skills simultaneously. 


3.5 Results 


For all the algorithms, we use three replications of 2-fold 
cross validation, which gives us 6 folds in total on which 
the results are averaged. We use Area under the curve of 
Receiver Operating Characteristics (ROC), which we will 
refer as the AUC. Paper [6] discusses the inconsistent pro- 
cedures used to compute and compare performance of BKT 
and DKT. We therefore compute AUC both by averaging 
on all data points and by averaging on skills. The results of 
our experiments on funtoot dataset are shown in figure 3. 


When AUC is averaged on all the data points, the relative 
difference in performance between algorithms is very low, 
0.83 being the lowest and 0.88 being the highest. PFA and 
DKT share the highest performance of 0.88 AUC. Multi-skill 
DKT lags a bit behind DKT by 0.03 AUC units (0.85 AUC). 
All the variants of BKT also lag behind DKT and PFA by 
not a very big margin, the highest being 0.05 AUC units. 
BKT has the lowest AUC of 0.83, BKT+FSA has the highest 
AUC of 0.85 and the rest of them have an AUC of 0.84, which 
depicts that they all show equivalent performance. 


The relative difference in performance between algorithms is 
higher when AUC is averaged on skills, the lowest being 0.64 
AUC of BKT-+F and highest being 0.88 AUC of PFA which 
is 37.5% gain. PFA with an AUC of 0.88 outperforms all 
the methods by having a minimum gain of 17% (0.75 AUC 
of DKT and BKT+FSA) and maximum gain of 37.5% (0.64 
AUC of BKT+F). Here also, the magnitude of difference 
between DKT and Multi-skill DKT is very less, 0.04 AUC 
units to be precise with Multi-skill DKT lagging behind. 


With BKT, BTK+F, BKT+A and BKT+FA having AUCs 
of 0.65, 0.64, 0.68 and 0.67 respectively, it is clear that For- 
getting adds no value. The number of skills discovered by 
both BKT+S and BKT+FSA are in the range of 145 — 175 
compared to 442 original skills. The Skill Discovery ex- 
tension provides reasonable gains which are evident from 
the AUCs of BKT and BKT+5S (9% gain) and BKT+FA 
and BKT+FSA (12% gain). The magnitude of the gains 
achieved by Abilities extension is very less, 0.003 AUC units 
in the case of BKT, BKT+A and BKT+F, BKT+FA. Fi- 
nally, the different variants of BKT achieve a gain of maxi- 
mum 15% over standard BKT. Notably, the best version of 
BKT, that is, BKT+FSA and DKT, perform equally. 


4. DISCUSSION AND FUTURE WORK 

Our aim of this study was to explore the performance of 
standard BKT, all of its variants proposed in [6], PFA and 
DKT on funtoot dataset. The results we have got are in 
sync with the results in [6]. When the AUC results were 
computed by averaging over skills, DKT and BKT+FSA 
perform equally well while DKT outperforms standard BKT 
with the gain of 15%. Also, BKT+S gave a performance 
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Figure 3: A comparison of PFA, DKT, Multi-skill DKT, BKT 
and its variants 


which was very close to DKT. Though DKT does perform 
better when the AUC results are averaged over all data 
points, the magnitude of the gain is significantly low. 


Similar kind of results hold true for PFA. PFA achieves a 
high gain compared to all the models when AUC results 
are averaged over skills. When AUC results are averaged 
over all data points, PFA equals DKT’s performance and 
outperforms the rest of the models, though not with a very 
high margin. This is not consistent with the results in [13] 
where DKT outperforms PFA though, not overwhelmingly. 


The above results reinforce the hypothesis proposed in [6] 
that the domain of knowledge tracing seems to be shallow 
and may not require the depth that the deep neural net- 
works offer. The predictive or the explanatory power of a 
model can also be characterized in terms of the number of 
parameters the model fits. One of the reasons why DKT is 
expected to be more successful than other models, at the 
cost of interpretability, is that it has weights in the order 
of hundreds of thousands. Moreover, being made up of a 
layer of LSTM cells, DKT has the capability of looking back 
arbitrary number of timesteps. On the contrary, variants 
of BKT and PFA are very simple and interpretable mod- 
els. Their simplicity can easily be attributed to the small 
number of parameters they fit. 


Standard BKT needs four parameters: plnit (the probabil- 
ity that the student is in learned state before the first prac- 
tice), pLearn (the probability that the student transitions 
from not learned state to the learned state at each prac- 
tice), pGuess (the probability that the student guesses the 
answer being in the unlearned state) and pSlip (the prob- 
ability that the student accidentally makes a mistake be- 
ing in the learned state). In PFA, it is even better, only 
three parameters are learned per skill - item difficulty and 
one coefficient each for prior failures and successes. With 
this, the total parameters for a few hundred skills (which 
is true in our case) would be a few hundred parameters: 
three x number of skills. Hence, in our context, it seems 
appropriate to say that few hundred parameters are better 
than few hundred thousand parameters. 


Both BKT and DK‘, in an abstract sense, are the models 
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which maintain the knowledge state of the student. With 
each response of the student, the knowledge states are up- 
dated and those states are used to generate future predic- 
tions. They both require the time series data of the student’s 
responses. This is significantly different than the type of 
data required by PFA. PFA operates on abstract features 
of student’s interactions like total number of prior successes 
and failures. It occurs to us that the abstract features are 
smoother than the time series data of responses. It seems 
the domain of knowledge tracing can be deciphered better 
if the abstract features are used instead of detailed trail of 
responses which might be noisy. More studies and experi- 
ments are required to validate this point. 


The skills used in our experiment are the LGs from the fun- 
toot Knowledge Graph which are tagged at the level of sub- 
subtopic which acts as a context of LG. Also, an LG can 
occur in multiple subsubtopics. ‘The discovered skills in our 
experiments of BKT+S and BKT+FSA were in the range 
of 145 — 175 which is close to the number of subsubtopics 
(119) in our dataset. We suspect that there is some relation 
between the subsubtopics in our dataset and the skills dis- 
covered. We would like to investigate this further in future. 
DKT also supports skill discovery as proposed in [9] which 
we would look into in future to compare the skills discovered 
by several algorithms. 


Funtoot dataset has items with multiple skills which forced 
us to extend DKT and come up with Multi-skill DKT. This 
variant of DKT underperformed marginally as compared to 
DKT. We do not have a clear understanding about why 
this is so and hence this also requires further study. Since 
we have used a very crude dataset, that is, does not contain 
features about attempts, time durations, hints, item context, 
etc., it would be interesting to use them with DKT and see 
if the depth of DKT can exploit them. 
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ABSTRACT 


Children are inherently curious and rapidly learn a number of things 
from the physical environments they live in, including rich vocabu- 
lary. An effective way of building vocabulary is for the child to ac- 
tually interact with physical objects in their surroundings and learn 
in their context [17]. Enabling effective learning from the physical 
world with digital technologies is, however, challenging. Specifi- 
cally, a critical technology component for physical-digital interac- 
tion is visual recognition. The recognition accuracy provided by 
state-of-the-art computer vision services is not sufficient for use in 
Early Childhood Learning (ECL); without high (near 100%) recog- 
nition accuracy of objects in context, learners may be presented 
with wrongly contextualized content and concepts, thereby making 
the learning solutions ineffective and un-adoptable. In this paper, 
we present a holistic visual recognition system for ECL physical- 
digital interaction that improves recognition accuracy levels using 
(a) domain restriction, (b) multi-modal fusion of contextual infor- 
mation, and (c) semi-automated feedback with different gaming 
scenarios for right object-tag identification & classifier re-training. 
We evaluate the system with a group of 12 children in the age group 
of 3-5 years and show how these new systems can combine existing 
APIs and techniques in interesting ways to greatly improve accura- 
cies, and hence make such new learning experiences possible. 


1, INTRODUCTION 


Children learn a lot from the physical environment they live in. One 
of the important aspects of early childhood learning is vocabulary 
building, which happens to a substantial extent in the physical en- 
vironment they grow up in [17]. Studies have shown that failure 
to develop sufficient vocabulary at an early age affects a child’s 
reading comprehension and hence their ability to understand other 
important concepts that may define their academic success in the 
future. It is also evident from a study that failure to expose a child 
to sufficient number of words by the age of three years leads to a 30 
million word gap between kids who have been exposed to a lot of 
quality conversations, versus the ones that have not been exposed 
as much [13]. 


Vocabulary building has been a theme for early childhood learn- 
ing and is closely associated with its context in the physical world. 


“these authors contributed equally to this work 


The exploration of physical surroundings of the child triggers new 
vocabulary and vice-versa. Relating physical world objects and 
concepts, to digital world content requires seamless flow of infor- 
mation. Increasingly, availability of cheap sensors such as camera, 
microphone etc. on connected devices enable capture of physical 
world information and context, and translate them to personalized 
digital learning. 


An envisioned system uses mobile devices to take pictures of the 
child’s physical surroundings and make the best sense out of the 
picture. This is then translated to a learning session where the child 
is taught about the object in focus, its relation to other objects, 
its pronunciation, it’s multiple representations, etc. Recognition 
of pictures for teaching a child requires high recognition accuracy. 
In-the-wild image recognition accuracies are in general low, espe- 
cially for images taken with mobile devices. Moreover, pictures 
taken by a child is even more challenging given the shake, blur, 
lighting issues, pose etc. that come with it. 


To this end, in this paper, we take a holistic approach of recognition- 
in-context using a combination of (a) domain restriction, (b) multi- 
modal fusion of contextual information, and (c) gamified disam- 
biguation and classifier re-training using child-in-the-loop. Specifi- 
cally, we use object recognition results from a custom-trained (with 
images from restricted domains) vision classifier, and combine them 
with information from the domain knowledge that is available when- 
ever a new domain of words is taught to a child in the classroom or 
at home. We use a new voting based multimodal classifier fusion 
algorithm to disambiguate the results of vision classifier, with re- 
sults from multiple NLP classifiers, for better accuracy. We show 
that using such a framework, we can attain levels of accuracy that 
can make a large majority of the physical-digital interaction expe- 
riences fruitful to the child, and also get useful feedback from the 
child at a low cognitive load to enable the system to retrain the clas- 
sifier and improve accuracy. We tested our system with a group of 
12 children in the age group of 3-5 years and show that children can 
play an image disambiguation game (that allows the child to verify 
what class label has actually been identified by the system) very 
easily with graceful degradation of performance on difficult im- 
ages. In most cases, multi-modal context disambiguation improves 
object recognition accuracy significantly, and hence the human dis- 
ambiguation step remains limited to one or two rounds, which en- 
sures the child’s continuing interest in the games and learning ac- 
tivities. The system learns from the child feedback, and the child 
in turn feels engaged to enable the system to learn over time. The 
nuggets of information made available about the object in focus at 
the end of playing a game were also found to be very engaging by 
the child. 


In summary, this paper makes the following contributions: 


e We take a holistic approach to address the challenges with 
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automatic visual recognition for physical digital interaction 
to enable early childhood learning in context. Our three- 
stage approach includes (a) domain restriction, (b) contex- 
tual disambiguation and (c) gamified human disambiguation, 
which enables a platform for building a variety of early child- 
hood learning applications with physical-digital interaction. 


e We propose a novel re-ranking algorithm that uses the no- 
tion of strong vouching to re-order the output labels of a vi- 
sion classifier based on strong supporting evidence provided 
by the additional context from semantic representation mod- 
els in NLP, namely GloVe [6], Word2vec [11] and Concept- 
Net [5] (which can be textual cues in the form of classroom 
and curriculum context, domain focus, conversational input 
and clues, etc.). Note that, we use the terms "re-order" and 
"re-rank" interchangeably throughout this paper. 


e We evaluate a simple disambiguation game for children to 
choose the right label from the Top-K labels given out by 
the system. Through an usability study with 12 children, we 
make the case that engaging user experiences can indeed be 
developed to bridge the gap between automatic visual recog- 
nition accuracies and the requirement of high accuracy for 
meaningful learning activities. 


2. MOTIVATION AND RELATED WORK 


Early childhood learning applications with physical-digital interac- 
tion fall into two categories: (1) Application-initiated activities: In 
this category, the child is given a context by the application and is 
required to find relevant physical object and take a picture [2]. For 
example, the application may prompt the child to take a picture of 
"something that we sit on", "a fruit", “something that can be used 
to cut paper’, etc. (11) Child-initiated activities: In this category, 
the child takes a picture of an object and intends to know what it is, 
where it comes from, other examples of the same type of objects, 
etc. For example, the child may take a picture of a new gadget or 
machine found in school, a plant or a leaf or a flower, etc. and 
wants to know more about them. 


In each of these categories, the application is required to identify 
what the object is with Top-1 accuracy (i.e. a vision recognition so- 
lution should emit the right label at the top with high confidence). 
While a lot of advancement has been made in the improvement of 
accuracy of vision classifiers, Top-1 accuracy levels are still rela- 
tively low, although Top-5 accuracy levels (i.e. the right label is 
one of the top 5 labels emitted) are more reasonable. Nevertheless, 
the goal is to be able to work with the Top-5 list, and using the 
techniques described earlier, push the Top-1 accuracy to acceptable 
levels for a better interaction. 


2.1 Vision Recognition Accuracy 

To understand the efficacy of state-of-the-art solutions quantita- 
tively, we experimented with two deep convolution neural networks 
(Baseline Model 1: VGGNet [18] and Baseline Model 2: Inception 
V3 [19]). Inception V3 has been found to have 21.2% top-1 error 
rate for ILSVRC 2012 classification challenge validation set [8]. 
Even in experiments where baseline models were custom trained 
with 300 training images per class and tested with images taken 
from iPad, we observed low Top-1 accuracy (of 72.6% in Baseline 
Model | and 79.1% in Baseline Model 2); i.e. one in about four im- 
ages will be wrongly labeled. Even the Top-5 accuracy is 88.05% in 
Baseline Model | and 89.3% in Baseline Model 2. We also trained 
the Baseline models with the complete Imagenet[8] images for the 
considered classes and we observed <1% improvement. Further, 
when multiple objects are present in the image frame, the Top-1 
accuracy degrades further (38.2% in Baseline Model | and 44.5% 
in Baseline Model 2 for 2 objects in a frame), and so does Top- 
5 accuracy (of 77.9% in Baseline Model 1 and 85.6% in Baseline 
Model 2). Note that this could be a common scenario with children 


taking pictures, in which multiple objects get captured in a single 
image frame. Observe that recent Augment Reality (AR) Applica- 
tions such as Blippar [4], Layer [9], Aurasma [3] rely on similar 
vision recognition task, and hence run into similar inaccuracies in 
uncontrolled settings. While adult users of such applications may 
be tolerant to inaccuracies of the application, children may get dis- 
engaged when the system detects something wrongly or is unable 
to detect at all. 


2.2  Miulti-modal Information Fusion 

Using additional information to identify the objects holds promise 
in imporving the accuracy of vision recognition. For instance, sev- 
eral past works ( [22], [14], [15]) improve the image classification 
output based on the text features derived from the image. Specif- 
ically, authors in [20] propose techniques that train the model 
specifically with images that contain text, for efficient extraction of 
text and image features from the image. They also propose fusion 
techniques to merge these features for improving image recogni- 
tion accuracies. While this may be possible in some scenarios, the 
application’s accuracy will remain a challenge when such textual 
information embedded in the image is not present. Several works 
in literature propose indexing of images based on text annotations 
for efficient image search. [12] surveys and consolidates various 
approaches related to efficient image retrieval system based on text 
annotations. Likewise, [21] proposes techniques to label images 
based on image similarity concepts. These works are complemen- 
tary, and do not address the problem of correctly determining the 
labels right when a picture is taken based on a context. 


In summary, the early childhood learning scanarios require a holis- 
tic solution that leverages the state-of-the-art vision recognition so- 
lutions, but goes beyond in improving the detection accuracy of the 
image captured to make engaging applications for children. We 
describe one such holistic solution next. 


3. PROPOSED APPROACH 


Our goal is to enable a holistic solution for applications to provide 
as input an image taken by a child, and emit as output the final label 
that should be used as an index into the relevant learning content. A 
high level overview of our solution is depicted in Figure. 1. In one 
of the envisioned applications built for physical-digital interaction, 
a child takes a picture that is sent as input to the proposed ECL Im- 
age Recognition (ECL-IR) Module that emits the correct label of 
the image by applying the following three stages: (1) Stage 1: Do- 
main Specific Customized Training (which improves Top-K accu- 
racy), (11) Stage 2: Domain Knowledge (DK) based disambiguation 
and reordering (which improves Top-1 accuracy) and (ii1) Stage 3: 
Human Disambiguation game (confirmation step). We now discuss 
each of these stages in detail. 


3.1 Stage 1: Domain Specific Customized Train- 


ing of Baseline Models 

The first stage of our solution strives to improve the Top-K accu- 
racy of the vision classifiers by constraining the domain of child 
learning in which they are applied. In order to achieve this, we per- 
form custom training of the baseline models with domain-specific 
data sets. This step is very commonly applied in most of the vi- 
sion recognition use-cases for improving the Top-K accuracy and 
several reported statistics indicate good Top-K accuracy improve- 
ments through custom training. For example current state-of-art 
vision classifier [19] reports 94.6% Top-5 accuracy on ILSVRC 
2012 classification challenge validation set. However, even this 
state-of-art vision classifier reports 21.2% Top-1 error rate on the 
same validation set. In the next section, we discuss how ECL-IR 
module improves Top-1 accuracy through contextualized reorder- 
ing (Stage 2). 
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Figure 1: High Level Solution Overview 


3.2 Stage 2: Domain Knowledge based Dis- 


ambiguation and Reordering 

In this section, we propose to improve the Top-1 accuracy through 
intelligent reordering of the Top-5 labels from the vision classifier. 
In-order to achieve this, we leverage the domain knowledge asso- 
ciated with the teaching activity as a second source of information 
to re-order the Top-5 output labels. Domain Knowledge refers to 
the classroom learning context (derived from teacher’s current syl- 
labus, teaching themes, object related clues, collaborative clues) 
based on which the learning activity is conducted. Note that the 
Domain Knowledge could be a word or a phrase too. We now dis- 
cuss various important aspects of this stage in detail. 


Enabling Semantic Capability. Domain Knowledge is a text 
representation of the intent or activity derived from the classroom 
context. However, same intent or information could be conveyed 
through different keywords, and hence traditional bag-of-word ap- 
proaches [23] will not solve the problem in our use-cases. We lever- 
age the support of semantic representations (1.e. distributed word 
representation [16]) of words for enabling keyword independent 
re-ranking algorithm. In distributed word representation, words 
are represented as N-dimensional vectors such that distance be- 
tween them capture semantic information. There are various pre- 
trained semantic representation models (also called word embed- 
ding models such as Word2Vec [11], GloVe [6]) available which 
enable semantic comparison of words. Likewise, there is also Con- 
ceptNet [5] which is a multilingual knowledge base, representing 
words and phrases that people use and the common-sense relation- 
ships between them. This paper leverages these existing works to 
achieve an effective re-ranking of the output label-set with semantic 
capability. 


Existing Approach Results. One naive way to approach the 
problem of re-ranking is to find the DK Correlation Score (DK-CS) 
using Algorithm. 1 and re-rank the Top-5 labels in descending order 
of their DK-CS. However, this approach has strong bias towards the 
semantic representation output and completely ignores the ranking 
that is produced by the vision classifier. 


Other fusion approaches that have been tried are combining one or 
more of the classifier outputs (1) Word2Vec (S1), (ii) GloVe (S2), 
(111) ConceptNet (S3), (iv) Vision (S4) in different ways. The most 
common are the product rule and the weighted average rule where 
the confidence scores are combined by computing either a product 
of them or a weighted sum of them. The improvement in Top-1 ac- 
curacy of such combinations varies from -11% to 6%. We observe 
that the Top-1 accuracy of the system did not increase significantly 


Algorithm 1: Algorithm to calculate DK Correlation Score 


Input: Label, Domain Knowledge text (DK) 
Output: DK Correlation Score (i.e. Semantic correlation between 
DK and Label) 

For every word in DK, fetch its corresponding N-dimensional 
semantic vector from the semantic representation model. 

Representation(DK) <— Compose N-dimensional vector for the 
complete DK by combining word level vectors to a phrase level 
vector using linear average technique 

Representation(Label) <— Fetch N-dimensional vector for the label 
from the semantic representation model 

DK Correlation Score = Cosine Distance between 
Representation(DK) and Representation(Label) 

return DK Correlation Score 


and in many cases Top-1 accuracy of the system dropped after re- 
ranking as compared to the original list. The reason being the need 
for proper and more efficient resolution of conflicts between DK- 
CS wins vs. vision confidence score wins. In the next section, we 
explain the proposed novel re-ranking algorithm which highly im- 
proves the Top-1 accuracy of the system by effectively resolving 
the conflicts between DK-CS and vision rankings. 


Proposed Re-Ranking Approach. 1n our proposed approach, 
we fuse the inferences from various semantic models and vision 
model using Majority-Win Strong Vouching algorithm for re-ordering 
the Top-5 output list. There are two important aspects of this ap- 
proach: (1) Strong Vouching of Semantic Models, (11) Majority Vot- 
ing across Semantic Models. 


Strong Vouching of Semantic Models: As discussed earlier, the 
reason for failure of the traditional fusion approaches is the need for 
efficient resolution of conflicts between the semantic model ranks 
and the vision model ranks. Let us understand this problem through 
2 example scenarios. (1) Scenario 1: Top-1 prediction is "orange", 
Top-2 prediction is "apple", domain Knowledge is "fruits"; (11) Sce- 
nario 2: Top-1 prediction is "orange", Top-2 prediction is "apple", 
domain knowledge is "red fruits". In the first scenario, since the 
domain knowledge is semantically correlated towards both Top-1 
and Top-2 predicted labels, system should maintain the same or- 
der as predicted by the vision model. However, in the second sce- 
nario, since the domain knowledge (i.e. "red fruits") is highly cor- 
related towards Top-2 (i.e. "apple")as compared to Top-I(i.e. "or- 
ange"), system should swap the order of Top-1 and Top-2 labels. 
It turns out that just having a higher DK-CS to swap the labels is 
not enough. We show that DK-CS of one label (label-1) should 
override the other label (label-2) by a specific threshold value to 
indicate that label-1 is semantically more correlated with as com- 
pared to label-2 and hence effect a swap against the vision rank. 
Through empirical analysis in Section. 4.2, we show that, in the 
context of reordering Top-K labels, if normalized DK-CS of a label 
is greater than the other label by a value equal to 1/k (threshold 
value), then the former label is more semantically correlated with 
domain knowledge as compared to the latter. 


Majority Voting across Semantic Models: As mentioned before, 
many semantic models exist in the literature and each of them are 
trained on various data-sets. Therefore, it is not necessary that the 
strong vouching behavior of all these semantic models to be same. 
In order to resolve this, our approach considers multiple semantic 
models together (such as GloVe, Word2Vec and ConceptNet) and 
enables swapping of i-th label with j-th label (i<j) in the Top-K out- 
put list only when majority of semantic models are strongly vouch- 
ing that j-th label is more correlated with DK as compared to the 1- 
th label. This makes the system more intelligent in resolving across 
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semantic models as well as resolving conflicts across DK correla- 
tion score wins vs. vision confidence score wins. Algorithm. 2 
explains the overall flow of the proposed re-ranking algorithm. 


Algorithm 2: Fusion based on Majority Win Strong Vouching Con- 


cept 


Input: Top-K output label from image recognition model, 
Domain Knowledge(DK) 
Output: Reordered Top-K output label list 
Sort Top-K labels based on vision confidence score 
2 Re-rank the Top-K label by sorting using the following compare 
logic 
Compare logic (i-th label, j-th label, DK): begin 
[Note] i-th label precedes j-th label in the ranked Top-K list. 
X1 = Total number of semantic models strongly-vouching for 
j-th label as compared to 1-th label 
X2 = Total number of semantic models strongly-vouching for 
i-th label as compared to j-th label 
if X/>X2 then 
| swap 1-th label and j-th label in the Ranked Top-K list 


else 
|_ Maintain the same order of 1-th label and j-th label 


return Re-Ranked Top-K List 


3.3. Stage 3: Human Disambiguation Game 

It is important to note that, due to limitation of existing state-of- 
art vision models, though we achieve effective improvements, we 
never reach an accuracy of 100%. Even after effective custom train- 
ing and DK based Top-K re-ranking, accuracy of the system is not 
100% (though high improvements are observed). So, there has to 
be a confirmation step involving human-in-loop to confirm whether 
the predicted label is the right label to prevent teaching wrong ob- 
jectives. Since we are dealing with Kids, this step has to be ex- 
tremely light, simple, and also engaging for the Kids so that, they 
do not feel any extra cognitive load. In this section, we propose 
a simple disambiguation game which is designed in a way that, 
(i) Kids easily play with it correctly, (11) Kids interaction with the 
game highly reduces when Top-1 accuracy of the system is high. 
Through enhancements as explained in previous sections, we make 
vision model to reach high Top-1 accuracy which in-turn reduces 
the Kids interactions in the disambiguation game, thereby reducing 
the overall cognitive overload to the Kids. 


Our system leverages image matching for the disambiguation game. 
Re-ranked Top-K list (which is the output from Stage 2) is fed as 
input to the disambiguation game. This game is depicted in Fig- 
ure. 2 renders reference images of the label (with possible variants 
of a same object) one by one in the order of the re-ranked list and 
asks the Kid to select the image, if it looks similar to the object 
clicked (through camera). If not, system show the next reference 
image and continues till all K labels are rendered. Since the input 
to the game is a re-ranked Top-K list (which has high Top-1 ac- 
curacy), Kid has high chances of encountering the right image in 
the first or second step itself, thus reducing the cognitive load of 
the kid to traverse till the end. Usability Guidelines [10] [1] for 
Child based Apps suggest large on-screen elements which are well 
spatially separated for Kids to easily interact with them. So, based 
on the display size of the form-factor, system could configure the 
no of images to be rendered in one step/cycle. Through usability 
study with 15 Kids, we show that Kids are able to easily play im- 
age similarity based disambiguation games. In scenarios when the 
right label is not in the predicted Top-K labels, system executes the 
exit scenarios as configured. Few possible exit scenarios could be: 
(1) Continue the game with other labels in the learning vocabulary 
set in the sorted order of DK, (11) Request for teacher intervention, 
etc. 


Labels rendered in the order followed in Re-ranked Top-k 
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[Voiceover] 
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Figure 2: Basic Disambiguation Game 


4. EVALUATION 


We present here the experimental setup and results of improvement 
in the vision classifier results achieved by the re-ordering approach. 
We then explain and present the results of the empirical analysis to 
determine the value of threshold for strong vouching of the seman- 
tic models. To show that our approach is independent of domain 
knowledge, test set, training class set, and baseline image classifi- 
cation models (generality of approach), we performed various ex- 
periments as explained in following subsections. Later in this sec- 
tion, we present the usability study and inferences from the study 
conducted with a group of 12 children in the age group of 3-5 years. 


Datasets: The training dataset includes images from Imagenet [8]. 
We used 52 classes and approximately 400 images per class for 
training. These 52 selected classes are objects commonly used in 
early childhood learning, for example, apple, car, book, and violin, 
etc. The test datasets include real images taken from mobile phones 
and tablets. The test dataset I includes 1K images where single 
object (from training set) is present in an image frame. The test 
dataset II includes 2.6K images where two objects (from training 
set) are present in an image frame. All the experiments were per- 
formed using two baseline image classification models: (i) Base- 
line Model 1 (BM1): Model based on VGGNet architecture [18], 
(11) Baseline Model 2 (BM2): Model based on Inception-V3 archi- 
tecture [19]. 


Domain Knowledge: During all the experiments, we used two 
different domain knowledge (DK): Domain Knowledge 1 (DK1), 
which is the google dictionary definition [7] of each object class; 
Domain Knowledge 2 (DK2), which is the merged description of 
each object class collected from three different annotators (crowd- 
sourced approach). By this way, we make sure that the domain 
knowledge is not keyword dependent and re-ordering happens at 
semantic level rather than at any specific keyword matching level. 


Evaluation Metrics: In order to illustrate the performance of the 
proposed approach, evaluation parameters such as Top-1 accuracy, 
Top-5 accuracy, and improvements in Top-1 accuracy are used. The 
Top-1 accuracy is computed as the proportion of images such that 
the ground-truth label is the Top-1 predicted label. Similarly, the 
Top-5 accuracy is computed as the proportion of images such that 
the ground truth label is one of the Top-5 predicted labels. 


4.1 Experimental Results 

The cumulative accuracy distribution of Baseline Model | (BM1) 
and Baseline Model 2 (BM2) on test dataset I and II is shown in 
Figure. 3. Figures 3(a), 3(b) shows the improvement in the Top-1 
accuracy after re-ordering on dataset I which has one object in an 
image frame. As shown in Figure. 3, for BM1, without re-ordering 
only 35% of object classes have Top-1 accuracy more than 90%, 
whereas with re-ordering using DK1 or DK2 around 55% of classes 


Proceedings of the 10th International Conference on Educational Data Mining A57 


~® Top-1 Accuracy without DK -@ Top-5 Accuracy without DK 
~~ Top-1 Accuracy with DK1 ai Tap-l Accuracy with DK2 


~®Top-1 Accuracy without DK ~-& Top-5 Accuracy without DK 
~~Top-1 Accuracy with DK1 —i Top-1 Accuracy with DK2 


100 7, 2. a2 @ 100 A * «é i. i 
wort TST tt pe ee ai 
: tt v agiice 
80 at P 80 iz as o—+ 
—~ 70 L e cai 70 -_" eT 
x oe = — 
60 # 60 r 
G — J Fa} i s 
g 50 i g 50 
g* g 0 | 
& 30 | < 30 
20 20 
10 10 
BM2 
0 o d 
0 «10 15 20 25 30.35 40 45 50 55 6065 70 75 $0 $5 90 95100 O 5 1015 20 25 30 35 40 45 5055 6065 70 75 #0 #5 90 95100 
CLASSES (96) CLASSES (5) 


-®Top-1 Accuracy without DK = -@ Top-5 Accuracy without DK 
~ Top-1 Accuracy with DK1 “i Top-1 Accuracy with DK2 


-®-Top-1 Accuracy without DK = -@ Top-5 Accuracy without DK 
~~ Top-1 Accuracy with DK1i -i-Top-1 Accuracy with DK2 


100 gt}? ‘es 100 grout +o gees 
90 a ge t ' 90 a = 
a0 * -. # 80 f 
= 70 i in re =—70 , ss 
3 ' £ = 
3 60 Ps a 3 60 a r 
2 50 , a G 50 d ‘4 
40 * 40 
g / i g ! i a 
aw F # < 30 | 
20 9 a 20 le 
10 # 10 uF 7 
o* BM1 iv BM2 
a a - = a 
0 5 i bh ald Ge de de eat oe tia 4 0 1015 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100 
oti pal CLASSES (%] 
(c) (d) 


Figure 3: Cumulative accuracy distribution of Baseline Model | & 
Baseline Model 2 on the data set. I(a-b), II(c-d) 


have more than 90% Top-1 accuracy. Similarly, for BM2 our ap- 
proach shows 20% improvement in number of classes for 90% or 
above Top-1 accuracy on dataset I as shown in Figure 3(b). 

When a child takes an image, it is common that multiple objects 
get captured in that image. If more than one object is present 
in an image, then the confusion of the classifier highly increases 
which leads to low Top-1 accuracy. Figure. 3(c), 3(d) show the 
improvement in Top-1 accuracy on data set IH, where two objects 
(from training set) are present in an image frame. As shown in Fig- 
ure. 3(c), for BM1, without re-ordering only 7% of object classes 
have Top-1 accuracy more than 90% whereas with re-ordering us- 
ing DK1 or DK2 around 40% of classes have more than 90% Top-1 
accuracy. Similarly, for BM2, our approach shows improvement 
of 45% in number of classes for 90% or more Top-1 accuracy on 
dataset II as shown in Figure. 3(d). 


4.2 Empirical analysis to determine threshold 


for strong vouching of semantic models 

In this section, we explain the empirical analysis which determines 
the threshold value required by semantic models for strong vouch- 
ing as discussed in Section. 3.2. In comparing two elements with 
respect to their semantic correlation with domain knowledge (i.e. 
DK-CS), the threshold stands for the minimum value by which DK- 
CS of one element should be higher than the other to confidently 
say that the element is semantically more correlated with the do- 
main knowledge as compared to the other element. Choice of cor- 
rect threshold value is very cruicial for the proposed approach. The 
threshold value should be as high so as to avoid wrong swapping 
of labels, and as low to allow correct swapping of labels for better 
Top-1 accuracy improvements. 

For the empirical analysis of threshold value, we conducted exper- 
iments on dataset II with the following combinations (1) four dif- 
ferent domain knowledges collected through crowd sourcing, (11) 
four different threshold values, and (111) for both baseline models 
(BM1&BM2) to make it independent of any local data-behavior. 
The results are shown in Figure. 4. From the results, we noticed 
that the correct threshold value is 0.2 for reordering Top-5 predicted 
labels. As observed in Figure. 4, Top-1 accuracy reaches the peak 
value when the threshold value is 0.2. We now discuss the reason 
behind this magical number. 
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Figure 4: Improvement in Top-1! accuracy while reordering pre- 
dicted Top-5 labels for different domain knowledge, threshold val- 
ues and baseline models 


In our approach, we use normalized DK-CS, which means if we 
consider equal distribution of labels while reordering Top-5 pre- 
dicted labels, then the DK-CS for each label is 0.2 (i.e. 1/5). We 
propose that, if DK-CS of one label overrides the semantic score 
of another label by a value near or equal to the 1/k (i.e. individual 
DK-CS of the labels considering equal distribution of each label), 
then it is considered as strong vouching by semantic model for the 
former label. 

In order to confirm the above proposed claim, we performed ex- 
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Figure 5: Improvement in Top-1 accuracy while reordering pre- 
dicted Top-4 labels for different Domain Knowledge, Threshold 
Values and Baseline Models 


periments to reorder Top-4 predicted labels (results are shown in 
Figure. 5). From the results, we can see that the performance is 
at peak for threshold between values 0.2 and 0.3, which is near to 
0.25 (1/k where k is number 4). There is very noticeable degra- 
dation in performance when threshold is below 0.2 or above 0.3. 
Similar trends were also observed when experimenting with Top-3 
re-ordering. 

Therefore, the correct choice of threshold while re-ordering Top-k 
predicted labels is 1/&. When system is tuned to vouch strongly 
using this threshold value, we observe high improvements in Top-1 
accuracy. 


4.3 Usability Study 


The main purpose of this usability study is to observe the following 
key points in children of ages between 3-5 years: (1) whether they 
can take images using the camera of a phone or tablet, (41) whether 
they can perform visual comparison between the physical object 
for which picture was taken, and its reference image provided by 
the classifier in the disambiguation game, (i111) comparison of cog- 
nitive load on children when they see less vs. more number of 
images on a device screen during the game. To conduct this study, 
we asked the child to play with our app installed on iPads, which 
logged the complete click stream data of the app for tracking vari- 
ous quantitative parameters. We also noted down the feedback from 
parents/observer during the activity play. 


We conducted this usability study on 12 children with a total of 29 
trials. In each trail, a child was allowed to play with the app as long 
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ABSTRACT 


Student learning strategies play a critical role in their overall 
success. The central goal of this study is to investigate how 
learning strategies are related to student success in an online 
adaptive mathematics tutoring system. To accomplish this goal, 
we developed a model to predict student performance based on 
their strategies in ALEKS, an online learning environment. We 
have identified student learning strategies and behaviors in seven 
main categories: help-seeking, multiple consecutive errors, 
learning from errors, switching to a new topic, topic mastery, 
reviewing previous mastered topics, and changes in behavior over 
time. The model, developed by using stepwise logistic regression, 
indicated that requesting two consecutive explanations, making 
consecutive errors and requesting an explanation, and changes in 
learning behaviors over time, were associated with lower success 
rates in the semester-end assessment. By contrast, the reviewing 
previous mastered topics strategy was a positive predictor of 
success in the last assessment. The results showed that the 
predictive model was able to predict students’ success with 
reasonably high accuracy. 


Keywords 


Help-seeking, errors, learning strategy, math, student success, 
adaptive tutoring system 


1. INTRODUCTION 


Computer-based learning environments, particularly intelligent 
tutoring systems (ITS), are becoming more commonly used to 
assist students in their acquisition of knowledge. Computer-based 
tutors provide tailored instruction and one-to-one tutoring, which 
can improve students’ learning experiences and their motivation. 
These learning systems also provide unique and critical insight to 
learning science researchers by creating exhaustive archives of 
student learning behaviors. A central goal of investigating student 
learning processes is to unveil the associations between learning 
behaviors and performance, ultimately allowing learning system 
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developers and researchers to predict and understand student 
performance. This knowledge allows for evidence-based and 
individually tailored feedback to be provided to students who are 
struggling to learn. 


2. RELATED WORK 


Many studies have investigated the relationships between learning 
behaviors and success in learning [1, 2]. The most frequent 
learning behaviors used in the current literature involve help- 
seeking, making errors, persistence, and changes in learning 
behaviors over time [3, 4, 5]. For example, worked examples, an 
effective and commonly used type of help, can be overused by 
students, negatively affecting learning [6]. However, asking for 
help after making an error has been found to be an effective help- 
seeking strategy, particularly for high prior knowledge students 
[7]. Additionally, reading a worked example after solving a 
problem can foster better learning than practice alone and reading 
a worked example before solving a problem can improve learning 
when compared to reading a worked example after solving a 
problem [8, 9]. 


Clearly, there is a delicate interplay between help-seeking 
strategies students use, their prior knowledge, and learning 
success. Whether students benefit from making errors often 
depends on how errors are approached pedagogically. Errors, 
when treated as stemming from student inadequacies, can trigger 
math anxiety, which negatively affects students’ learning [10, 11]. 
An extreme example of making errors during learning is seen in 
wheel-spinning behaviors, in which students attempt ten problems 
or more without mastering the topic. While too many consecutive 
errors (i.e. wheel-spinning) undermine learning performance [12], 
repeated failure in the low-skill phase has been found to improve 
the likelihood of success in the next step [5] and to lead to more 
robust learning [13]. Furthermore, the errors that naturally occur 
from desirable difficulty are considered to be an essential element 
in learning [14] and facilitate long-term knowledge retention and 
transfer [15, 16, 17]. 
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Many of the current computer-based tutoring systems are 
designed to provide students more autonomy by allowing them to 
learn at their own pace. In self-paced or self-regulated tutoring 
systems, students’ learning behaviors tend to change over time 
during learning. These changes in learning behaviors over time 
represent an important aspect of learning for researchers to 
understand. Relatively more well-structured behavior over time is 
positively related to reading performance, whereas more chaotic, 
less-structured learning behaviors are related to poor reading 
performance [4]. 


Persistence is another increasingly studied behavior in learning 
research. For example, persistence is measured as time spent on 
unsolved problems during solving anagrams and riddles [18]. 
Persistence on challenging tasks is associated with mastery goals, 
which benefit learning [19]. Given these definitions of 
persistence, a contrasting learning behavior could be considered 
frequently switching topics within a learning system to find easier 
topics, an example of gaming the system [20]. Based on students’ 
self-reports, persistence was also found to positively related to 
student satisfaction with the computer-based tutoring system [21]. 
However, unproductive persistence (i.e. wheel-spinning) impedes 
learning, but various formats of problems and spaced practice can 
reduce unproductive persistence and improve learning [22]. 


Reviewing previous learned materials is an efficient way to 
improve learning. Per Ebbinghuas’ forgetting curve [23], memory 
retention declines over time. Repeated exposure to previously 
learned materials can enhance memory retention and improve 
learning [24]. An example of reviewing previously learned 
materials is seen in the retrieval practice, which was found to 
improve students’ memory retention of reading materials [25] as 
well accuracy in solving “student-and-professor’ algebra word 
problems [26]. 


This study aims to investigate which learning behaviors predict 
student success in ALEKS (Assessment and Learning Knowledge 
Spaces), a math tutoring system that adapts to students’ 
knowledge [27]. Given the above literature, help-seeking 
behaviors, multiple consecutive errors, learning from errors, 
temporal behavioral changes, persistence (i.e. switch to a new 
topic without mastering the current topic), and reviewing previous 
mastered topics were selected as potential predictors of success in 
ALEKS. In addition, the percentage of topics that have been 
mastered, an indicator of learning progress, is included in the 
model to predict success. 


3. Description of ALEKS 


ALEKS is a web-based artificially intelligent learning and 
assessment system [27]. Its artificial intelligence is based on a 
theoretical framework called Knowledge Space Theory (KST) 
[28]. KST allows domains to be represented as a knowledge map 
consisting of many knowledge states, which represent the 
prerequisite relationships between different knowledge states 
(KS). Therefore, KST allows for a precise description of a 
student’s current knowledge state, and what a student is ready to 
learn next. ALEKS can estimate a student’s initial KS by 
conducting a diagnostic assessment (based on a test) when the 
student first begins to interact with the system. ALEKS conducts 
assessments during students’ progress through the course to 
update their knowledge states and to decide what the student is 
ready to learn next. 


For each topic within ALEKS, a problem is randomly generated, 
with adjustments made to several parameters for each problem 
type. This results in an enormous set of unique problems. Students 


are required to provide solutions in the form of free-response 
answers, rather than by selecting an answer from multiple choices. 
Explanations in the form of worked examples can be requested by 
students at any time. When an explanation is requested, a worked 
example for the current problem is provided and a new problem is 
provided to the student. The interface of ALEKS 1s displayed in 
Figure 1. 


ALEKS is self-paced; students can choose topics to learn and can 
choose when they want to request help. All the topics that the 
student is most ready to learn (per the KST model) are displayed 
in his or her knowledge pie (Figure 2). The knowledge pie 
presents the student’s learning progress in each math subdomain 
as well. 


Research has shown ALEKS produces learning outcomes 
comparable with other effective tutoring systems for teaching 
Algebra [29]. Using ALEKS as an after-school program has also 
been observed to be as effective as interacting with expert 
teachers [32]. Students need less assistance during learning when 
using ALEKS than in traditional curricula [31]. Additionally, 
ALEKS has been found to reduce the math performance 
discrepancies between ethnicities in an after-school program [32]. 
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ALEKS HELE [D WORKSHEET (2) INBOY REPORT OPTIONS fron ~ BT 
@ MyPic | gi Review > Dictionary 9) Quiz | 

Your answer is incorrect. The fraction can be 
simplified 
Try to answer again 
How much of the circle is shaded? Write your answer as 
a fraction in simplest form. 


a Le en 
| Clear Undo Help | 
Next >> Explain 


Figure 1. The ALEKS interface 
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Figure 2. The ALEKS knowledge pie 
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hours per topics learned per expected hours necessary to expected weeks necessary to 
wees 


4. Data L(Mr—>My.1) = Pr(Me4.1|Mr) — Pr(My,1) 


The data used in this study was collected from 179 students within 1 — Pr(My.1) (1) 

11 college classes that used ALEKS for developmental 

mathematics in Fall 2016. The data is comprised of information Shannon entropy is used to compute the degree of regularity in the 
about students’ learning actions and assessment scores. These changes in students’ learning behaviors over time (specifically 
actions include “correct” (C), “wrong” (W), mastering a topic (S; focusing on the shifts between making an error, give a correct 
three C’s in a row within a single topic), failing a topic (F; five answer, and requesting an explanation) [34] (Equation 2). High 
W’s in arow within a single topic) and explanations (E; entropy values represent disordered leaning behavior patterns. On 
requesting an explanation). The data also contains students’ last the contrary, low entropy implies ordered pattern of learning 
assessment scores in ALEKS which account for students’ behaviors: 

performance in ALEKS. N 

5. METHODS H(x)=- 3 P(x;)ogeP(xi)) 

We employed stepwise logistic regression with backward as (2) 


elimination to predict students’ success in ALEKS, using a 
training-test split. More details of this process are described 


The details on how the features were computed are listed below in 
pStew. table 1. 


5.1 Student success 
Success in ALEKS is defined as students knowing 60% or more Table 1. Descriptions of features used to predict success 


of the topics in their last assessment. Therefore, we adopted 60% 


in the semester-end assessment as a cut-off value for success. 


Students whose last assessment score was 60% or greater were W The transition probability from making an error 
grouped as “successful students”, whereas those with last to requesting an explanation 
assessment scores under 60% was grouped as “unsuccessful 


E 
EE The transition probability from requesting an 


students”. The dataset was randomly split into two parts: 60% of 
explanation to requesting another explanation 


students’ data were used to train the model (N=107), and 40% 
were used to test the model’s generalizability (N= 72). Success 
was labeled as | and failure was labeled as 0 in the prediction 
model. 


WW The transition probability from making an error 
to making an error again 
EW 


W The transition probability from making an error 
5.2 The features to predict success and requesting an explanation to making an error 
The following behavior patterns were used to predict student 
success: (1) help-seeking i.e., requesting an explanation after The transition probability from making an error 
making an error (WE), and requesting two sequential explanations and requesting two sequential explanations to 
(EE); (2) multiple consecutive errors i.e., making two sequential making an error again 


errors (WW), making an error again after an error and requesting 


The proportion of times a student made five 
an explanation (WEW), making an error again after an error and ae en 
requesting two explanations (WEEW), and the overall percentage 
of failure labeled by ALEKS (PF); (3) learning from errors 1.e., 


The transition probability from making an error 
providing a correct answer after making an error (WC), providing to giving a correct answer 


a correct answer after making an error and requesting an The transition probability from making an error 
explanation (WEC), and providing a correct answer after making and requesting an explanation to giving a correct 


an error and requesting two explanations (WEEC); (4) switching aieeer 
to a new topic 1.e., switching to a new topic after making an error 
or requesting an explanation (PNew), and switching to a new 
topic because of failure on a topic (FNew); (5) topic mastery (PS), 
1.e. providing three correct responses in a row; (6) reviewing 


The transition probability from making an error 
and requesting two sequential explanations to 
giving a correct answer 


previous mastered topics (PReview); and finally, (7) changes in The probability of starting a new topic after 
learning behaviors over time (measured using the entropy metric). making an error or requesting an explanation on 
The features of the first four aspects mentioned above were the current topic 

generated by using D’Mello’s likelihood metric [33] (Equation 1). The probability of starting a new topic after 
The likelihood metric 1s used to compute the transition probability failing a topic 


of an event to another event. In the case of multiple events, we 
calculate a proportion of each sequence out of the number of 
sequences of that length. For example, the probability of WEEW 


means the transition probability of WEE to W. In this case, WEE PReview | The percentage of mastered topics that the 
is represented as M; and W is represented as M,;; in the formula. student reviews after mastering them 

When the value produced by the likelihood metric is higher than Entropy The entropy value produced based on students 
0, it signifies that M+; occurs after M; more frequently than the learning behaviors 

base rate of M,.; Otherwise, M;,; occurs after M; at a rate lower or 


equal than the base rate of My. 


The proportion of the mastered topics out of the 
number of the attempted topics during learning 


Proceedings of the 10th International Conference on Educational Data Mining 462 


6. RESULTS 


6.1 Description of features 

Before building the prediction model, we calculated basic 
descriptive statistics. The mean and standard deviations are listed 
in Table 2. 


Table 2. Feature means and standard deviations 
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6.2 Model development 


Stepwise logistic regression with backward elimination was used 
to generate the predictive model of students’ success. The final 
model included requesting an explanation after making an error 
(WE), requesting two sequential explanations (EE), making an 
error again after making an error and requesting an explanation 
(WEW), changes in learning behaviors over time (entropy) and 
review on the topic (PReview). Each of these metrics were 
statistically significant predictors of students’ success (i.e. the 
score in the last assessment is greater or less than 60%) in 
ALEKS. The details on the prediction model are displayed in 
Table 3. 


Table 3. The results of multi-feature logistic regression on 
students’ success 


S.E. Z value 


ee 
a 
i oe 


ve 


PReview 9.44 


Note. p<.000 — , p<.05 


Tape 


The results of multicollinearity indicated that there were low 
correlations between features. The VIF value (i.e. variance 
inflation factor) for each feature is illustrated in Table 4. 


Furthermore, logistic regressions that only include one single 
feature were conducted to examine suppression effect. The results 
were listed in Table 5. The results showed that compared to the 
results of miulti-feature logistic regression, the direction of 
relationship between each feature and success did not change in 
the single-feature logistic regression. Therefore, the relationship 
between features and success was not impacted by suppression 
effect. 


Then, based on the results of logistic regressions, students were 
less likely to be successful in the last assessment if they tend to 
read two consecutive explanations, or made an error after making 
an error and requesting an explanation, or demonstrated 
irregularity in their learning behaviors. By contrast, the more 
frequently students reviewed topics they have already mastered, 
the more likely they were to pass the last assessment in ALEKS. 


Table 4. Multicollinearity between features in the prediction 
model 


Carp pe pa 


Table 5. The summary of single-feature logistic regressions on 
students’ success 


ve x 


Z value 


6.3 Model goodness 


The fitness index of the prediction model (i.e. AIC) of training 
data was 115.67. McFadden pseudo r’ of training data was .30, 
indicating that this model predicts a substantial amount of the 
variance in student success. 


The model’s accuracy of prediction on test data was 0.71. The 
AUC of test data (area under the ROC curve) was 0.77. The plot 
of the ROC curve is illustrated in Figure 3. 


oO © 
oC © 
= 
@ 
2 
oO 
oO USS, 
ao 983 
@ 
= 
—_ 
eK 
© 
Oo 


00 O02 04 06 08 1.0 


False positive rate 


Figure 3. The ROC plot of the prediction model 
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7. DISCUSSIONS 


The current study developed a logistic model to predict student 
overall success in ALEKS, as well as the relationship between 
various learning behaviors and success. Our findings contribute to 
the current understanding of the relationship between student 
learning behaviors and their delayed performance in adaptive 
tutoring systems, as well as provide evidence-based suggestions 
for improving the feedback and interventions in ALEKS. 


Requesting two sequential explanations (EE) had a negative 
relationship with success in the last assessment, a finding in line 
with previous research on the negative effect of overusing help on 
learning [8]. However, the EE behaviors may suggest that 
students did not understand the first explanation rather than 
indicating that the students were “gaming the system”. This can 
be concluded for the following reason. After requesting a worked- 
examples explanation, the student typically receives a new 
problem. Making an error again after making an error and 
requesting an explanation (WEW) was negatively related to 
students’ success. The relationship between WEW and success 
suggests that students frequently make multiple consecutive 
errors, even after receiving the provided worked examples. These 
students may have trouble understanding the example. Therefore, 
if students frequently demonstrate those two behaviors on a 
specific problem, more individually-tailored and deeper-level 
instructions may be needed to provide the necessary help to 
overcome the impasse, such as concept-specific conversations 
with tutor agents that are integrated in ALEKS. 


Another finding conforming to the previous research was that 
regular behaviors during learning is positively related to students’ 
performance [cf. 5]. In this study, the measurement of changes of 
behaviors over time (via Shannon entropy) is relatively coarse- 
grained. Moving forward, deeper and finer-grained investigations 
of changes in behavior over time may shed further light on why 
regularity is associated with better outcomes. 


Another finding worth noting was that the percentage of topics 
mastered (PS) during learning was not found to be a significant 
predictor of success on the last assessment. An explanation of this 
finding may lie in the adaptive design of ALEKS. During 
learning, ALEKS continually matches _ students’ existing 
knowledge with topic difficulty and provides the topics that 
students are most ready to learn, so students focus their time on 
topics that have an appropriate level of difficulty [22]. Thus, the 
percentage of topics being mastered may not differ much between 
students who were successful in the last assessment and those who 
failed the last assessment. Finally, reviewing previously mastered 
topics (PReview) was found to be positively linked to students’ 
success in the last assessment, which confirmed the findings of 
literature [24]. 


Our model was able to accurately predict student success. 
However, some improvements can be made in the future. The 
current model only includes percentages or probabilities of 
behaviors without considering the time spent on these behaviors. 
In the future, adding the time duration of behaviors may increase 
the prediction accuracy of the model. Additionally, refining the 
measurements of behaviors may increase the prediction accuracy 
of the model. For example, changes in learning behaviors over 
time could be measured during different learning phases or in 
specific temporal sequences. 


By better understanding the factors associated with success in 
ALEKS, we can design interventions that will improve student 
success — the ultimate goal of any intelligent tutoring system. 
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ABSTRACT 


We report an experimental implementation of adaptive learning 
functionality in a self-paced HarvardX MOOC (massive open 
online course). In MOOCs there is need for evidence-based 
instructional designs that create the optimal conditions for 
learners, who come to the course with widely differing prior 
knowledge, skills and motivations. But users in such a course are 
free to explore the course materials in any order they deem fit and 
may drop out any time, and this makes it hard to predict the 
practical challenges of implementing adaptivity, as well as its 
effect, without experimentation. This study explored the 
technological feasibility and implications of adaptive functionality 
to course (re)design in the edX platform. Additionally, it aimed to 
establish the foundation for future study of adaptive functionality 
in MOOCs on learning outcomes, engagement and drop-out rates. 
Our preliminary findings suggest that the adaptivity of the kind 
we used leads to a higher efficiency of learning (without an 
adverse effect on learning outcomes, learners go through the 
course faster and attempt fewer problems, since the problems are 
served to them in a targeted way). Further research is needed to 
confirm these findings and explore additional possible effects. 


Keywords 


MOOCs; assessment; adaptive assessment; adaptive learning. 


1. INTRODUCTION 


Digital learning systems are considered adaptive when they can 
dynamically change the presentation of content to any user based 
on the user’s individual record of interactions, as opposed to 
simply sending users into different versions of the course based on 
preexisting information such as user’s demographic information, 
education level, or a test score. Conceptually, an adaptive learning 
system is a combination of two parts: an algorithm to dynamically 
assess each user’s current profile (the current state of knowledge, 
but potentially also affective factors, such as frustration level), 
and, based on this, a recommendation engine to decide what the 
user should see next. In this way, the system seeks to optimize 
individual user experience, based on each user’s prior actions, but 
also based on the actions of other users (e.g. to identify the course 
items that many others have found most useful in similar 
circumstances). Adaptive technologies build on decades of 


research in intelligent tutoring systems, psychometrics, cognitive 
learning theory and data science [1, 3, 4]. 


Harvard University partnered with TutorGen to explore the 
feasibility of adaptive learning and assessment technology 
implications of adaptive functionality to course (re)design in 
HarvardX, and examine the effects on learning outcomes, 
engagement and course drop-out rates. As the collaboration 
evolved, the following two strategic decisions were made: (1) 
Adaptivity would be limited to assessments in four out of 16 
graded sub-sections of the course. Extra problems would be 
developed to allow adaptive paths; (2) Development efforts would 
be focused on Harvard-developed Learning Tools Interoperability 
(LTI) tool to support assessment adaptivity on edX platform. 
Therefore, in the current prototype phase of this project, adaptive 
functionality is limited to altering the sequence of problems, based 
on continuously updated statistical inferences on knowledge 
components a user mastered. As a supplement to these assessment 
items, a number of additional learning materials are served 
adaptively as well, based on the rule that a user should see those 
before being served more advanced problems. 


While the prototype enabled us to explore the feasibility of 
adaptive assessment technology and implications of adaptive 
functionality to course (re)design in HarvardX, it is still 
challenging to judge its effects on learning outcomes, engagement 
and course drop-out rates due to the prototype limitations. 
However, we believe that the study will help to establish a solid 
foundation for future research on the effects of adaptive learning 
and assessment on outcomes such as learning gains and 
engagement. [5] 


2. SETUP AND USER EXPERIENCE 


The HarvardX course in this experiment was “Super-Earths and 
Life”. It deals with searching for planets orbiting around stars 
other than the Sun, in particular the planets capable of supporting 
life. The subject matter is physics, astronomy and biology. 
Roughly speaking, the course aims at users with college-level 
knowledge of physics and biology. Some of the assessment 
material in the course requires calculations, and some requires 
extensive factual knowledge (e.g. questions about DNA structure). 
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Two versions of the course have already run in the edX platform, 
our adaptivity was implemented as part of the course re-design for 
the third run. 


A number of subsections in the course contained assessment 
modules (homeworks). The experiment consisted of making four 
of these homeworks adaptive for some of the users. At the 
moment of their registration, the course users were randomly split 
50%-50% into an experimental group and into a control group. 
When arriving to a homework, users in the control group see a 
predetermined, non-adaptive set of problems on a page. The same 
is true for the experimental group in all homeworks except the 
four where we deployed the adaptive tool. In these homeworks, a 
user from the experimental group is served problems sequentially, 
one by one, in the order that is individually determined on-the-fly 
based on the user’s prior performance. In addition to problems, 
some instructional text pages were also included in the serving 
sequence. 


To enable adaptivity, we manually compiled a list of knowledge 
components (KCs, for our purposes synonymous with “learning 
objectives”, “learning outcomes”, or “skills’) and tagged 
problems in the course with one or several knowledge 
components. This tagging was done for a// assessment items in 
the course (as well as for some learning materials), enabling the 
adaptive engine to gather information from any user’s interaction 
with any problem in the course, not only with those problems that 
are served adaptively. Additionally, the problems in the 4 adaptive 
homeworks were tagged with one of three difficulty levels: 
advanced, regular and easy (other problems in the course were 
tagged by default as regular). No pre-requisite relationships or 
other connections among the knowledge components were used. 


The adaptive engine (a variety of Bayesian Knowledge Tracing 
algorithm) decides which problem to serve next based on the list 
of KCs covered by the homework and course material. Additional 
rules could be incorporated into the serving strategy. Thus, we had 
a rule that before any problem of difficulty level “Advanced”’, the 
user Should see a special page with advanced learning material. 


The parity between experimental and control groups was set up as 
follows. In the pool from which problems are adaptively served to 
the experimental group, all the regular-difficulty problems were 
the ones that the control group saw in these homework. The 
control group had access to the easy and advanced problems as 
well: students in this group saw a special “extra materials” page 
after each of the 4 experimental homeworks. This page contained 
the links to all the advanced instructional materials and advanced 
and easy problems for this homework, for no extra credit. Thus, 
all the materials that an experimental user can see, were also 
available to the control students. There were two main reasons for 
this: obvious usefulness for comparative studies, and enabling all 
students, experimental and control, to discuss all problems in the 
course forum. 


When an experimental group user is going through an adaptive 
homework, the LTI tool loads edX problem pages in an iFrame. 


Submitting (“checking”) an answer to the problem triggers an 
update of user’s mastery, but does not trigger serving the next 
problem. For that to happen, the user has to click the button “Next 
Question” outside the iFrame. The user always can revisit any of 
the previously served problems. 


In edX, users usually get several attempts at a problem. Thus, it 
may be possible for a user to submit a problem after the next 
problem has already been served. Fig. 1, for instance, shows a 
situation, where so far 4 problems have been served (note the 
numbered tabs in the upper left), but the user is currently viewing 
problem 2 in this sequence, not the latest one. The user is free to 
re-submit this problem, which will update the user’s mastery 
(although in this case there is no need to do so, since it appears 
that problem 2 has been answered correctly). It will not alter the 
existing sequence (problems 3 and 4 will not be replaced by 
others), but it may have effect on what will be served as 5 and so 
on. 


The user interface keeps track of the total number of points earned 
in a homework (upper right corner in Fig. 1). The user knows how 
many points in total are required and may choose to stop once this 
is achieved (earning more points will no longer affect the grade). 
Otherwise, the serving sequence ends when the pool of questions 
is exhausted. Potentially, it could also end when the user’s 
probability of mastery on all relevant KCs passes a certain 
mastery threshold (a high probability, at which we consider the 
mastery to be, in practical terms, certain; it was set to 0.9). 
However, in this particular implementation, due to having only a 
modest number of problems, this was not done. 


In order to explore possible effects of adaptive experiences on 
learners’ mastery of content knowledge competence-based pre- 
and post-assessment were added to the course and administered to 
study participants in both experimental and control groups. 
Typical HarvardX course clickstream time-stamped data and pre- 
post course surveys data was collected. 


2.1 Course Design Considerations 

Adaptive learning techniques require the development of 
additional course materials, so that different students can be 
provided with different content. For our prototype, tripling the 
existing content in the four adaptive subsections was considered a 
minimum to provide a genuine adaptive experience. This was 
achieved by work from the project lead and by hiring an outside 
content expert. This did not provide each knowledge component 
with a large number of problems, reducing the significance of 
knowledge tracing, but it was sufficient for the purpose of our 
experiment. The total time outlay was ~200 hours. Keeping the 
problems housed within the edX platform avoided substantial 
amounts of software development. 


The tagging of content with knowledge components was done by 
means of a shared Google spreadsheet, which contained a list of 
content items in one sheet (both assessment and learning 
materials), a list of knowledge components in another, and a 
correspondence table (the tagging itself), including the difficulty 
levels, in the third. 


Most of the time was spent on creating new problems based on the 
existing ones. For these the tagging process was “reversed”: rather 
than tag existing content with knowledge components, the experts 
created content targeting knowledge components and difficulty 
levels. Commonly, an existing problem was considered to be of 
“regular” difficulty, and the expert’s task was to create an “easy” 
and/or an “advanced” version of it. 
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103 distinct knowledge components were used in tagging. The 
experts used their judgement in defining them. 66 of these were 
used in tagging problems, and in particular the 39 adaptively 
served problems were tagged with 25 KCs. The granularity of 
KCs was such that a typical assessment problem was tagged with 
one learning objective (which is desirable for knowledge tracing). 
Namely, among the adaptively served problems, 31 were tagged 
with a single KC, 7 problems — with 2 KCs, and | problem — with 
> 


2.2 LTI Tool Development 


To enable the use of an adaptive engine in an edX course, Harvard 
developed the Bridge for Adaptivity (BFA) tool (open-source, 
GitHub link available upon request). BFA is a web application 
that uses the LTI specification to integrate with learning 
management systems such as edX. BFA acts as the interface 
between the edX course platform and the TutorGen SCALE 
(Student Centered Adaptive Learning Engine) system, and 
handles the display of problems recommended by the adaptive 
engine. Problems are accessed by edX XBlock URLs. 


This LTI functionality allows BFA to be embedded in one or 
more locations in the course (4 locations in our case). The user 
interface seen by a learner when they encounter an installed tool 
instance is that shown in Fig. 1. 
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Figure 1. Adaptive assessment user interface 


Problems from the edX course are displayed one at a time in a 
center activity window, with a surrounding toolbar that provides 
features such as navigation, a score display, and a shareable link 
for the current problem (that the learner can use to post to a forum 
for help). The diagram in Fig. 2 describes the data passing in the 
system. The user-ids used by edX are considered sensitive 
information and are not shared with SCALE: we created a 
different user-id system for SCALE, and the mapping back and 


forth between the two id-systems happens in the back end of the 
app. 
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checks 
answer 


| (siudent, activity, score} 


edX XBlock 


Student clicks Load new 
“Next Question” edX XBlock 
button 


Ask 
adaptive 
engine for 
a id 


App front end 


App back end LTI app receives 
answer data and 
stores it in 


database 


Get XBlock 
URL for 
activity id 

t 


(sludent. activity. score) (studern) | (activity) 


| 


TutorGen 
SCALE Transaction Activity API 
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Figure 2. Diagram of data passing in the system 


Every problem-checking event by the user (both inside and 
outside the adaptive homeworks) sends the data to SCALE, to 
update the mastery information real-time. Every “Next Question” 
event in an adaptive homework sends to SCALE a request for the 
next content item to be served to the user (this could be 
instructional material or a problem). SCALE sends back the 
recommendation, which is accessed as an edX XBlock and 
loaded. 


The edX support for LTI is highly stable. The challenge is that 
edX exports data on a weekly cycle, but we needed to receive the 
information about submits in real time. We achieved this by 
creating a reporting JavaScript and inserting it into every problem. 


2.3 TutorGen Adaptive Engine 

TutorGen SCALE is focused on improving learning outcomes 
using data collected from existing and emerging educational 
technology systems combined with the core technology to 
automatically generate adaptive capabilities. Key features that 
SCALE provides include knowledge tracing, skill modeling, 
student modeling, adaptive problem selection, and automated hint 
generation for multi-step problems. SCALE engine improves over 
time with additional data and/or with the help of human input by 
providing machine learning using a human-centered approach. 
The algorithms have been tested on various data sets in a wide 
range of domains. For successful implementation and optimized 
adaptive operations, it is important that the knowledge 
components be tagged at the right level of granularity. 


SCALE has been used in the intelligent tutoring system 
environment, providing adaptive capabilities during the formative 
learning stages. SCALE with HarvardX for this course is being 
used more as in the assessment stage of the student experience. In 
order to accomplish the goals of the prototype for this pilot study, 
we extended our algorithms to consider not only the knowledge 
components (KCs), but also problem difficulty. This will 
accommodate the needs for this course by providing an adaptive 
experience for students while still supporting the logical flow of 
the course. Further, the flexible nature of the course, having all 
content available and open to students for the duration of the 
course, presents some additional requirements to ensure that 
students are presented with problems based on their current state 
and not necessarily where the system believes they should 
navigate. 
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A variety of serving strategies are available in SCALE and can be 
swapped in and out. In this particular implementation, while the 
algorithm did trace the students’ knowledge, the results were used 
minimally in the serving strategy: it did not make sense to do 
otherwise given the small size of the adaptive problem pool. 
SCALE was configured to consider after each submit: the 
probability of the learner has mastered the KCs from the problem 
most recently worked, the difficulty of that problem, and the 
correctness of the submitted answer. A general and simplified 
explanation of the process is as follows. Each of the four adaptive 
modules was treated as a separate instance, with its own pool of 
problems. Each problem can be served to each learner no more 
than once. Given the last problem submitted by a learner in the 
module, the candidate to be served next is the (previously unseen) 
problem, whose KC tagging overlaps with the KCs of the last 
submitted problem and includes at least one KC, on which the 
user has not yet reached the mastery threshold. If multiple 
candidates are available, SCALE will serve the one with a KC 
closest to mastery. If no candidates are available, other problems 
of the same difficulty within the same module will be served (i.e. 
SCALE switches to another KCs). The difficulty level of the next 
served problem is determined by the last submit correctness. As 
long as problems of the same difficulty level as the last one are 
available, the learner will remain at that difficulty level. Once 
such problems are exhausted, SCALE will serve a more or less 
difficult problem, depending on whether the last submit in the 
module was correct or incorrect. 


2.4 Quantitative Details and Findings 

The course was launched on Oct 19, 2016. The data for the 
analysis presented in this paper were accessed on Mar 08, 2017 
(plus or minus a few days, since different parts of the data were 
extracted at different times), after the official end date of the 
course. 


Table 1. Number of students attempting assessment items of 
different difficulty level 


Experimental Control 
group group 
Regular level only 58 73 
Easy level only 0 O 
Advanced level only 1 0 
(Regular U Easy) levels only l 35 
(Regular U Advanced) levels only 105 0 
(Easy U Advanced) levels only @ ] 
(Regular U Easy U Advanced) levels 99 145 
~ Total students attempting new problems 964 254 


We will refer to the list of problems from which problems were 
served adaptively to the experimental group as “new problems”. 
The control group may have interacted with these as well, 
although not adaptively (as additional problems that do not count 
towards the grade). There were 39 new problems, out of which 13 
were regular difficulty (these formed the assessments for the 
control group of students), 14 were advanced and 12 were easy. 
For the control group, the advanced and easy problems were 
offered as extra material after assessment, with no credit toward 
the course grade. The numbers of students attempting assessment 
problems of different difficulty levels are given in Table 1. 


To get a sense of how the two groups of students performed in the 
course, we compared the group averages of the differences in 


scores in the pre-test and post-test. For reasons unrelated to this 
study, both tests were randomized: in each test each user received 
9 questions, randomly selected from a bank of 17. All questions 
were graded on the 0-1 scale. The users knew that the pre- and 
post- tests do not contribute to the grade, and so only about ~40% 
of users took both. Moreover, not all of these questions were 
relevant for (i.e. tagged with) those 25 knowledge components, 
with which the adaptively served problems were tagged. So the 
number of offered relevant questions varied randomly from user 
to user. For these reasons the pre- and post-test are not the most 
reliable measure of knowledge gain, but it was still important for 
us to make sure that adaptivity did not have any adverse effect. 
Each question was graded on the scale 0-1, and in Fig. 3 we 
subset the student population to those individuals who attempted a 
“new problem” and a relevant pre-test question and a relevant 
post-test question, and used the average score from relevant 
questions as the student’s relevant score. For instance, if one user 
attempted two relevant questions in a pre-test, and another user 
attempted three, and the questions were answered correctly, both 
users have the relevant score 1: (1+1)/2=(1+1+1)/3. 


Difference between post-test and pre-test scores (group averages) 
104 experimental users, 114 control users, 
@ Control 


i ; 


pre-test post-test pre-test post-test 
p = 0.22 p = 0.077 
ES = 0.17 ES = 0.24 


Average question score 
o 
» 


Figure 3. Comparison of relevant post-test and pre-test scores. 
Here and everywhere below, the p-values are two-tailed from 
the Welch two-sample t-test, and the effect size is the Cohen’s 
d (Cohen suggested to consider d=0.2 as “small”, d=0.5 as 
“medium” and d=0.8 as “large” effect size). 


There is no significant between-group difference, neither in the 
pre-test scores (p-value 0.49, effect size 0.093) nor in the post-test 
scores (p-value 0.21, effect size 0.17). The two populations of pre- 
test takers remain comparable after subsetting to those who 
attempted new problems and the post-test and we see no 
statistically significant difference in the knowledge gaining 
between the experimental and control groups. 


We did not see a difference in the final grade of the course: the 
mean grade was 83.7% in the experimental group vs. 82.9% in the 
control group, which is not a significant difference (p-value 0.76, 
effect size 0.06). Likewise, there is no significant between-group 
difference in the completion and certification rates (about 20%), 
or in demographics of students who did not drop out. 


Students in the experimental group tended to make more attempts 
at a problem (Fig. 4), and they tried fewer problems (Fig. 5), most 
strikingly among the easy new problems: for these we have 1,162 
recorded scores in the control group and only 423 in the 
experimental group. 
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Persistence: number of attempts per problem per user 


@ control 
®@ Experimental 


Average number of attempts 


adpt. Easy a. adpt. Reg. : Non-adpt. 
p = 7.9e-05 p = 0.0013 p = 5.7e-06 
ES = 0.25 ES = 0.091 ES = 0.093 


Figure 4. Comparison of attempt numbers between the 
experimental and control groups in the chapters where 
adaptivity was implemented. The attempt numbers are 
averaged both over the problems and over the users. Non- 
adaptive problems are problems not from the 4 experimental 
homeworks but from the same two chapters of the course as 
the experimental homeworks. 


Number of attempted problems 


16.63 15.76 @ contro! 


= 16 @ Experimental 

£ 

ral 

E 14 

2 

1 

wm 12 

E 

w 

Fe 

a 

So 8 

o 

Oo 

E 

a 

q 

+?) 

re 

g 

59) 

> 

<x 
adpt. Easy Non-adpt. 
p = 1.5e-07 p= 0. 00039 p = 0.26 
ES = -0.64 ES = -0.31 ES = -0.092 


Figure 5. Comparison of attempt numbers between the 
experimental and control groups in the chapters where 
adaptivity was implemented. Non-adaptive problems are 
problems not from the 4 experimental homeworks but from 
the same two chapters of the course as the experimental 
homeworks. 


The interpretation emerges that the students who experienced 
adaptivity showed more persistence by giving more attempts per 
problem (presumably, because adaptively served problems are 
more likely to be on the appropriate current mastery level for a 
student), while taking a faster track through the course materials. 
We also observed that the experimental group students tended to 
have a lower net time on task in the course: an average of 5.47 
hours vs. 5.85 in the control group (although in this comparison 
the p-value is high, 0.21, and the effect size is —0.11). 


Thus, we conjecture that the adaptivity of this kind leads to a 
higher efficiency of learning. Students go through the course 
faster and attempt fewer problems, since the problems are served 
to them in a targeted way. And yet there is no evidence of an 
adverse effect on the students’ overall performance or knowledge 
gain. Given the limited implementation of adaptivity in this 
course, it is not surprising that we cannot find a statistically 
significant effect on student overall performance in the course. 
We expect to refine these conclusions in the future courses with a 
greater scope of adaptivity. 


3. FUTURE WORK 


Our implementation of adaptivity provided some insights for 
future work. For instance, assessment questions in MOOCs can 
vary greatly in nature, difficulty and format (multiple choice, 
check-all-that-applies, numeric response, etc.), and may often be 
tagged with more than one knowledge component. To be suitable 
for a MOOC, an adaptive engine should be able to handle these 
features. 


There appear to be extensive opportunities to expand adaptive 
learning and assessment in MOOCs. The low total number of 
problems was the most severe restriction on the variability of 
learner experience in this study. In the future applications, larger 
sets of tagged items could provide a more adaptive learning 
experience for students, while also providing a higher degree of 
certainty of assessment results. Interestingly, in some MOOCs 
(for example, those teaching programming languages) it may be 
possible to create very large numbers of questions algorithmically, 
essentially by filling question templates with different data. 


In this study, adaptivity was implemented mostly on assessment 
problems. Given the structure of many MOOCs, more integration 
between learning content and assessment could provide an 
adaptive experience that would guide students to content that 
could improve their understanding based on how they perform on 
integrated assessments. 


Affective factors could be included to provide a more 
personalized learning experience. We can conceive an adaptive 
engine which decides what item to serve next based not just on the 
mastery but also on the behavioral patterns interpreted as boredom 
or frustration. 


Finally, this work could lead to improved MOOC platform 
features that would contribute to improved student experiences, 
such as optimized group selection [2]. In addition, we anticipate 
expanding this adaptive assessment system to work with other 
LTI-compliant course platforms. Enabling use in a platform such 
as Canvas, the learning management system used university-wide 
at Harvard (and many other schools), would enable adaptivity for 
residential courses on a large scale. An adjustment to the current 
system architecture would be the use of OpenEdX as the platform 
for creating and hosting problems. 
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prior to the course end) is to appear in the Proceedings of the 
Fourth Annual ACM Conference on Learning at Scale (L@S 
2017) as: Rosen, Y., Rushkin, I., Ang A., Fredericks C., 
Tingley D., Blink M.J. 2017. Designing Adaptive 
Assessments in MOOCs. 
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ABSTRACT 2. OVERVIEW AND RELEVANCE 

With the growing popularity of MOOCs and computer-aided Graph-based data mining and educational data analysis based 
learning systems, as well as the growth of social networks on graphical models have become emerging disciplines in 
in education, we have begun to collect increasingly large EDM. Large-scale graph data, such as social network data, 
amounts of educational graph data. This graph data in- complex user-system interaction logs, student-produced graph- 
cludes complex user-system interaction logs, student-produced ical representations, and conceptual hierarchies, carries mul- 
graphical representations, and conceptual hierarchies that tiple levels of pedagogical information. Exploring such data 
large amounts of graph data have. There is abundant peda- can help to answer a range of critical questions such as: 


gogical information beneath these graph datasets. As a re- 
sult, graph data mining techniques such as graph grammar 
induction, path analysis, and prerequisite relationship pre- 
diction has become increasingly important. Also, graphical 
model techniques (e.g. Hidden Markov Models or probabilis- 
tic graphical models) has become more and more important 
to analyze educational data. 


e For social network data from MOOCs, online forums, 
and user-system interaction logs: 


— What social networks can foster or hinder learn- 
ing? 
— Do users of online learning tools behave as we 


5 
While educational graph data and data analysis based on expect them to: 


graphical models has grown increasingly common, it’s nec- — How does the interaction graph evolve over time? 


essary to build a strong community for educational graph — What data we can use to define relationship graphs? 


researchers. This workshop will provide such a forum for FW path ode miekpartonmine stanieate take 
interested researchers to discuss ongoing work, share com- ; 
through online materials? 


mon graph mining problems, and identify technique chal- 


lenges. Researchers are encouraged to discuss prior analyses — What is the impact of teacher-interaction on stu- 
of graph data and educational data analyses based on graph- dents’ observed behavior? 

ical models. We also welcome discussions of in-progress work — Can we identify students who are particularly help- 
from researchers seeking to identify suitable sources of data ful in a course? 


or appropriate analytical tools. 
e For computer-aided learning (writing, programming, 


1. PRIOR WORKSHOPS oe 
So far, we have successfully held two international work- — What substructures are commonly found in student- 
shops on Graph-based Educational Data-Mining. The first produced diagrams? 


one was held in London, co-located with EDM 2014. It 
featured 12 publications of which 6 were full-papers, the re- 
mainder short papers. Having roughly 25 full-day attendees 
and additional drop-ins, it led to a number of individual con- 
nections between researchers and the formation of an e-mail 
list for group discussion. The second one was co-located with 
EDM 2015 in Spain. 10 authors presented their published tems? 
work including 4 full papers and 6 short papers there. 


— Can we use prior student data to identify stu- 
dents’ solution plan, if any? 


— Can we automatically induce empirically-valid graph 
rules from prior student data and use induced 
graph rules to support automated grading sys- 


Graphical model techniques, such as Bayesian Network, Markov 
Random Field, and Conditional Random Field, have been 
widely used in EDM for student modeling, decision making, 
and knowledge tracing. Utilizing these approaches can help 
to: 


e Learn students’ behavioral patterns. 


e Predict students’ behaviors and learning outcomes. 
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e Induce pedagogical strategies for computer-aided learn- 
ing systems. 


e Identify the difficult level of the knowledge components 
in the intelligent tutoring systems. 


Researches related to these questions can help us to bet- 
ter understand students’ learning status, and improve the 
teaching effectiveness and student learning. Our goal in this 
workshop is to bring together researchers with special inter- 
est in graph-based data analysis to 1) discuss state of the 
art tools and technologies, 2) identify common problems and 
challenges, and 3) foster a community of researchers for fur- 
ther collaboration. We will consider the submission of full 
and short papers as well as posters and demonstrations cov- 
ering a range of graphics topics that include, but are not 
limited to: 


e Social network data 

e Graphical solution representations 

e Graphical behavior models 

e Graph-based log analysis 

e Large network datasets 

e Novel graph-based machine learning methods 

e Novel graph analysis techniques 

e Relevant analytical tools and standard problems 
e Issues with graph models 


e Tools and technologies for graph grammar (pattern) 
recognition 


e Tools and technologies for automatic concept hierarchy 
extraction 


e Computer-aided learning system development involved 
with graphical representations 


e Use of graphical models in educational data 


We particularly welcome submissions of in-progress work 
both from students and researchers with problems who are 
seeking appropriate analytical tools, and developers of graph 
analysis tools who are seeking new challenges. 


3. WORKSHOP ORGANIZERS 


Dr.Collin F. Lynch is an Assistant Professor in the De- 
partment of Computer Science at North Carolina State Uni- 
versity. His primary research is focused on graph-based ed- 
ucational data mining, the development of robust intelligent 
tutoring systems, and adaptive educational systems for ill- 
defined domains such as scientific writing, law, and engi- 
neering. In his more recent work he has also been involved 
in the development of Intelligent Tutoring Systems for Logic 
and Probability and social networking analysis for research 
communities. 


Dr.Tiffany Barnes is an Associate Professor of Computer 


Science at NC State University. She received an NSF-CAREER 
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Award for her novel work in using data to add intelligence 
to STEM learning environments. That grant supported 
the development of InVis a novel tool that use graph-based 
representations of student-tutor interaction data to evalu- 
ate the impact of intelligent tutoring systems on student 
problem-solvers and to automatically extract hints and stu- 
dent advice from log data using graph-analysis. More re- 
cently she has received grants for the analysis of large-scale 
online courses and the development of procedural guidance 
from intelligent tutoring system data. 


Linting Xue is a third year Ph.D. student in the Depart- 
ment of Computer Science at North Carolina State Univer- 
sity. She is interested in the graph data mining methods 
for educational graph data. Her current research is focused 
on automatically graph grammars induction for student- 
produced argument diagrams. The induced graph grammars 
can be used as features for automatic grading and provide 
the hints for argumentative writing. 


Niki Gitinabard is a second year Ph.D. student in the 
Department of Computer Science at North Carolina State 
University. She is interested in social network analysis in 
learning environments. She is currently working on social 
graph generation and analysis based on students’ explicit 
and implicit interactions. 


4. WORKSHOP ORGANIZATION 


We will organize this workshop as a full or half-day mini- 
conference with time set aside for paper presentations, large- 
group discussion, and individual networking. We will open 
the workshop with a summary of prior meetings. We will 
spend the morning on presentations with a short discussion 
session before lunch. The afternoon session will be divided 
between presentations and working groups which will focus 
on identifying shared problems, small-group networking, and 
planning for follow up work. We will invite submissions of 
full papers which describe mature work. We will also accept 
short papers describing in-progress work or student projects, 
and poster/demo submissions for those presenting available 
data, tools, and methods. ‘This last category is particularly 
targeted at researchers who have data or methods available 
and are seeking to identify potential collaborators. 
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1. WORKSHOP TOPIC 


This workshop focuses on applications of deep learning for 
educational data. Deep learning is a machine learning approach 
using neural networks with multiple levels of representational 
transformation (i.e., hidden layers). Deep learning has been used 
in a variety of domains over the past five years with impressive 
results. Recently, it has been used for educational data sets with 
mixed results when compared to _ traditional modeling 
methodologies. 


We are interested in work on a variety of topics with deep 
learning: new prediction and modeling problems, best practices for 
featurizing data, network architectures, approaches to pre-training 
and whether it is necessary, interpreting the learned models, end- 
to-end deep learning approaches with low-level non-symbolic data, 
toolkits people have developed, empirical results on known 
problems to help the field develop best practices. The workshop is 
also interested in negative results such as analyses of data sets and 
domains where deep learning fails to achieve state of the art 
performance. 


2. GOALS OF WORKSHOP 


The primary goal of this workshop is to provide a venue for 
researchers to present emerging work. There is not much prior art 
on applying deep learning to educational data, and it is unclear even 
what the scope of possible applications are: although most work 
has focused on student modeling, some work has focused on using 
deep learning to assist in scoring essays. Having a discussion 
about possible application areas will be productive. 


In addition, this workshop will focus on recent big topics in deep 
learning for educational data. A paper published in 2016 “How 
deep is knowledge tracing” questions the need for deep models, and 
will be discussed at the workshop. 


Finally, this workshop will provide researchers on deep learning for 
EDM a chance to get focused feedback on their work. Ensuring 
that the research is critiqued by a roomful of people interested in 
the topic is more useful to the presenters (and the community) than 
counting on haphazard interactions at the conference. 
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Ran Liu 
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ABSTRACT 


This workshop will explore LearnSphere, an NSF-funded, 
community-based repository that facilitates sharing of educational 
data and analytic methods. The workshop organizers will discuss 
the unique research benefits that LearnSphere affords. In 
particular, we will focus on Tigris, a workflow tool within 
LearnSphere that helps researchers share analytic methods and 
computational models. Authors of accepted workshop papers will 
integrate their analytic methods or models into LearnSphere’s 
Tigris in advance of the workshop, and these methods will be 
made accessible to all workshop attendees. We will learn about 
these different analytic methods during the workshop and spend 
hands-on time applying them to a variety of educational datasets 
available in LearnSphere’s DataShop. Finally, we will discuss the 
bottlenecks that remain, and brainstorm potential solutions, in 
openly sharing analytic methods through a central infrastructure 
like LearnSphere. Our ultimate goal is to create the building 
blocks to allow groups of researchers to integrate their data with 
other researchers in order to advance the learning sciences as 
harnessing and sharing big data has done for other fields. 


Keywords 


Learning metrics; data storage and sharing; data-informed 
learning theories; modeling; data-informed efforts; scalability. 


1. INTRODUCTION 


Due to a confluence of a boom of interest both in educational 
technology and in the use of data to improve student learning, 
student learning activities and progress are increasingly being 
tracked and stored. There is a large variety in the kinds, density, 
and volume of such data and to the analytic and adaptive learning 
methods that take advantage of it. Data can range from simple 
(e.g., clicks on menu items or structured symbolic expressions) to 
complex and harder-to-interpret (e.g., free-form essays, discussion 
board dialogues, or affect sensor information). Another dimension 
of variation is the time scale in which observations of student 
behavior occur: click actions are observed within seconds in 
fluency-oriented math games or in vocabulary practice, problem- 
solving steps are observed every 20 seconds or so in modeling 
tool interfaces (e.g., spreadsheets, graphers, computer algebra) in 
intelligent tutoring systems for math and science, answers to 
comprehension-monitoring questions are given and learning 
resource choices are made every 15 minutes or so in massive open 
online courses (MOOCs), lesson completion is observed across 
days in learning management systems, chapter/unit test results are 
collected after weeks, end-of-course completion and exam scores 
are collected after many months, degree completion occurs across 
years, and long-term human goals like landing a job and achieving 
a good income occur across lifetimes. Different paradigms of 
data-driven education research differ both in the types of data they 
tend to use and in the time scale in which that data is collected. In 
fact, relative isolation within disciplinary silos is arguably 
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fostered and fed by differences in the types and time scale of data 
used [4, 5]. 


Thus, there is a broad need for an overarching data infrastructure 
to not only support sharing and use within the student data (e.g., 
clickstream, MOOC, discourse, affect) but to also support 
investigations that bridge across them. This will enable the 
research community to understand how and when long-term 
learning outcomes emerge as a causal consequence of real-time 
student interactions within the complex set of instructional options 
available [2]. Such an infrastructure will support novel, 
transformative, and multidisciplinary approaches to the use of 
data to create actionable knowledge to improve learning 
environments for STEM and other areas in the medium term and 
will revolutionize learning in the longer term. 


LearnSphere transforms scientific discovery and innovation in 
education through a scalable data infrastructure designed to enable 
educators, learning scientists, and researchers to easily collaborate 
over shared data using the latest tools and _ technologies. 
LearnSphere.org provides a hub that integrates across existing 
data silos implemented at different universities, including 
educational technology “click stream” data in CMU’s DataShop, 
massive online course data in Stanford’s DataStage and analytics 
in MIT’s MOOCodb, and educational language and discourse data 
in CMU’s new DiscourseDB. LearnSphere integrates these DIBBs 
in two key ways: 1) with a web-based portal that points to these 
and other learning analytic resources and 2) with a web-based 
workflow authoring and sharing tool called Tigris. A major goal is 
to make it easier for researchers, course developers, and 
instructors to engage in learning analytics and educational data 
mining without programming skills. 


2. SPECIFIC WORKSHOP OBJECTIVES 


Broadly, this workshop offers those in the EDM community an 
exposure to LearnSphere as a community-based infrastructure for 
educational data and analysis tools. In opening lectures, the 
organizers will discuss the way LearnSphere connects data silos 
across universities and its unique capabilities for sharing data, 
models, analysis workflows, and visualizations while maintaining 
confidentiality. 


More specifically, we propose to focus on attracting, integrating, 
and discussing researcher contributions to Tigris, the web-based 
workflow authoring and sharing tool. The goal of Tigris is to 
support any custom analysis method that can be applied to the 
datasets and to produce outputs in a standardized way that 
facilitates both quantitative and qualitative model comparisons. 
This workflow feature allows researchers to apply their own 
analysis methods to the vast array of datasets available in the 
educational data repository. It affords researchers the advantages 
of (1) using the built-in learning curve visualizations on the 
outputs of their own analysis workflows, (2) easily comparing 
their results both quantitatively and graphically to the outputs of 
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any other analysis methods that are currently in LearnSphere (e.g., 
Bayesian Knowledge Tracing [1], Performance Factors Analysis 
[6], MOOC activity analysis [3], and others) or that have been 
uploaded to LearnSphere as a custom workflow, and (3) sharing 
their own analysis workflows with the community of researchers. 
Without any prior programming experience, researchers can use 
LearnSphere’s drag-and-drop interface to compare, across 
alternative analysis methods and across many different datasets, 
model fit metrics like AIC, BIC, and cross validation as well as 
parameter estimates themselves. 


Workshop submissions will involve a brief description of an 
analysis pipeline relevant to modeling educational data as well as 
accompanying code. Prior to the workshop itself, the organizers 
will coordinate with authors of accepted submissions to integrate 
their code into Tigris. A significant portion of the workshop will 
be dedicated to hands-on exploration of custom workflows and 
workflow modules within Tigris. Authors of accepted submissions 
will present their analysis pipelines, and everyone attending the 
workshop will be able to access those analysis pipelines within 
Tigris to a variety of freely available educational datasets 
available from LearnSphere. The end goal is to generate, for each 
workflow component contribution in the workshop, a publishable 
workshop paper that describes the outcomes of openly sharing the 
analysis with the research community. 


Finally, workshop attendees will discuss bottlenecks that remain 
toward our goal of an easier, more open way to share analytic 
tools. We will also brainstorm possible solutions. Our goal is to 
create the building blocks to allow groups of researchers to 


integrate their data with other researchers we can advance the 
learning sciences as harnessing and sharing big data and analytics 
has done for other fields. 
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